Sunday, 5 November 2017

Naive Bayes Text classification


Doc -> {+, -}
Documents are a vector or array of words
Conditional independence assumption: No relation exists between words and they are independent of each other.
Probability of review being positive is equal to probability of each word classified as positive while going through the entire length of document 

Unique words- I, loved, the, movie, hated, a, great, poor, acting, good [10 unique words]
Involves 3 steps:
1. Convert docs to feature sets
2. Find probabilities of outcomes
3. Classifying new sentences

 Convert docs to feature sets

Attributes: all possible words
Values: no: of times the word occurs in the doc

FinProbabilities of outcomes

P(+)=3/5=0.6
No: of words in + case(n)=14
No: of times word k occurs in these cases + (nk) 
P(wk | +) =(nk + 1) /(n+|vocabulary|)
P(I|+)=(1+1)/(14+10)=0.0833
P(loved|+)=(1+1)/(14+10)=0.0833
P(the|+)=(1+1)/(14+10)=0.0833
P(movie|+)=(4+1)/(14+10)=0.2083
P(hated|+)=(0+1)/(14+10)=0.0417
P(a|+)=(2+1)/(14+10)=0.125
P(great|+)=(2+1)/(14+10)=0.125
P(poor|+)=(0+1)/(14+10)=0.0417
P(acting|+)=(1+1)/(14+10)=0.0833
P(good|+)=(2+1)/(14+10)=0.125
docs with –ve outcomes
p(-)=2/5=0.4
P(I|-)=(1+1)/(16+10)=0.125
P(loved|-)=(0+1)/(6+10)=0.0625
P(the|-)=(1+1)/(6+10)=0.125
P(movie|-)=(1+1)/(6+10)=0.125
P(hated|-)=(1+1)/(6+10)=0.125
P(a|-)=(0+1)/(6+10)=0.0625
P(great|-)=(0+1)/(6+10)=0.0625
P(poor|-)=(1+1)/(6+10)=0.125
P(acting|-)=(1+1)/(6+10)=0.125
P(good|-)=(0+1)/(6+10)=0.0625

Classifying new sentence

Eg: I hated the poor acting
Probability of sentence being positive,
P(+).P(I|+).P(hated|+).P(the|+).P(poor|+).P(acting|+)
0.6*0.0833*0.0417*0.0833*0.0417*0.0833=6.0*10-7
Probability of sentence being negative,
P(-).P(I|-).P(hated|-).P(the|-).P(poor|-).P(acting|-)
0.4*0.125*0.125*0.125*0.125*0.125=1.22*10-5
So the sentence is classified as negative.
If the word is not present in the vocabulary a very tiny probability is assigned to the word.

A calm and modest life brings more happiness than the pursuit of success combined with constant restlessness.

Sunday, 29 October 2017

Spark on MongoDB: Part- 4

Configuring Spark with Mongo DB

Two packaged connectors are available to integrate Mongo DB and Spark:
  • Spark Mongo DB connector developed by Stratio: Spark-MongoDB 
  • Mongo DB connector for Spark: mongo-spark
Dependencies required are:
  • Mongo DB Connector for Spark: mongo-spark-connector_2.10
  • Mongo DB Java Driver
  • Spark dependencies

Sample Code

Mongoconfig.scala
traitMongoConfig {
val database = "test"
val collection = "characters"
val host = "localhost"
val port = 27017
}
DataModeler.scala
import org.apache.spark.{ SparkConf, SparkContext }
import org.apache.spark.sql.SQLContext
import com.mongodb.spark.config._
import com.mongodb.spark.api.java.MongoSpark
import com.mongodb.spark._
import com.mongodb.spark.sql._

object DataModeler extends MongoConfig {


def main(args: Array[String]): Unit = {


println("Initializing Streaming Spark Context...")

System.setProperty("hadoop.home.dir", """C:\hadoop-2.3.0""")

val inputUri = "mongodb://" + host + ":" + port + "/" + database + "." + collection

val outputUri = "mongodb://" + host + ":" + port + "/" + database + "." + collection

val sparkConf = new SparkConf()

      .setMaster("local[*]")
      .setAppName("MongoConnector")
      .set("spark.mongodb.input.uri", inputUri)
      .set("spark.mongodb.output.uri", outputUri)
println("SparkConf is set...")

val sc = new SparkContext(sparkConf)


val dataHandler = new DataHandler()


//Data Insertion

dataHandler.insertData(sc)

// Helper Functions

val sqlContext = SQLContext.getOrCreate(sc)
var df = MongoSpark.load(sqlContext)

var rows = df.count()

println("Is collection empty- " + df.isEmpty())
println("Number of rows- " + rows)
println("First row- " + df.first())

//Data Selection

println("The elements are:")
sc.loadFromMongoDB().take(rows.toInt).foreach(println)

//Querying collection using SQLcontext

val characterdf = sqlContext.read.mongo()
characterdf.registerTempTable("characters")

val centenarians = sqlContext.sql("SELECT name, age FROM characters WHERE age >= 100")

println("Centenarians are:")
centenarians.show()

//Applying aggregation pipeline

println("Sorted values are:")
sqlContext.read.option("pipeline", "[{ $sort: { age : -1 } }]").mongo().show()

// Drop database

//MongoConnector(sc).withDatabaseDo(ReadConfig(sc), db =>db.drop())

println("Completed.....")


  }

}
DataHandler.scala
import org.apache.spark.SparkContext
import com.mongodb.spark.api.java.MongoSpark
import org.bson.Document
class DataHandler {
def insertData(sc: SparkContext) {
val docs = """
|{"name": "Bilbo Baggins", "age": 50}
|{"name": "Gandalf", "age": 1000}
|{"name": "Thorin", "age": 195}
|{"name": "Balin", "age": 178}
|{"name": "Kíli", "age": 77}
|{"name": "Dwalin", "age": 169}
|{"name": "Óin", "age": 167}
|{"name": "Glóin", "age": 158}
|{"name": "Fíli", "age": 82}
|{"name": "Bombur"}""".trim.stripMargin.split("[\\r\\n]+").toSeq
MongoSpark.save(sc.parallelize(docs.map(Document.parse)))
  }
}

Dependencies

Spark-assembly-1.6.1-hadoop2.6.0.jar
Mongo-java-driver-3.2.2.jar
Mongo-spark-connector_2.10-0.1.jar

References

  • https://docs.mongodb.com/manual/ 
  • https://docs.mongodb.com/spark-connector 
  • https://github.com/mongodb/mongo-spark 
  • https://github.com/Stratio/Spark-MongoDB 
  • https://www.mongodb.com/mongodb-architecture 
  • https://jira.mongodb.org/secure/attachment/112939/MongoDB_Architecture_Guide.pdf 
Be self motivated, We are the creators of our destiny



Tuesday, 24 October 2017

Spark on MongoDB: Part- 3

Terminologies



Mongo DB Processes and Configurations


  • mongod – Database instance
  • mongos - Sharding processes
Analogous to a database router.
Processes all requests
Decides how many and which mongods should receive the query
Mongos collates the results, and sends it back to the client.
  • mongo – an interactive shell (a client)
Fully functional JavaScript environment for use with a Mongo DB 

Getting started with Mongo DB


  1. To install mongo DB, go to this link and click on the appropriate OS and architecture: http://www.mongodb.org/downloads 
  2. Extract the files
  3. Create a data directory for mongoDB to use
  4. Open the mongodb/bin directory and run mongod.exe to start the database server.
  5. To establish a connection to the server, open another command prompt window and go to the same directory, entering in mongo.exe. 

Mongo DB CRUD Operation

1. Create 

db.collection.insert( <document> ) 
Eg: db.users.insert(
{ name: “sally”, salary: 15000, designation: “MTS”, teams: [ “cluster-management” ] }

db.collection.save( <document> ) 
2. Read 
db.collection.find( <query>, <projection> ) 
Eg: db.collection.find(
{ qty: { $gt: 4 } },
{ name: 1}

db.collection.findOne( <query>, <projection> ) 
3. Update 
db.collection.update( <Update Criteria>, <Update Action>, < Update Option> ) 
Eg: db.user.update( 
{salary:{$gt:18000}}, 
{$set: {designation: “Manager”}}, 
{multi: true} 
)
4. Delete 
db.collection.remove( <query>, <justOne> ) 
Eg: db.user.remove(
{ “name" : “sally" },
{justOne: true} 

)
When to Use Spark with MongoDB

While MongoDB natively offers rich analytics capabilities, there are situations where integrating the Spark engine can extend the real-time processing of operational data managed by MongoDB, and allow users to operationalize results generated from Spark within real-time business processes supported by MongoDB.
Spark can take advantage of MongoDB’s rich secondary indexes to extract and process only the range data it needs– for example, analyzing all customers located in a specific geography. This is very different from other databases that either do not offer, or do not recommend the use of secondary indexes. In these cases, Spark would need to extract all data based on a simple primary key, even if only a subset of that data is required for the Spark process. This means more processing overhead, more hardware, and longer time-to-insight for the analyst.
Examples of where it is useful to combine Spark and MongoDB include the following.
1. Rich Operators & Algorithms
Spark supports over 100 different operators and algorithms for processing data. Developers can use this to perform advanced computations that would otherwise require more programmatic effort combining the MongoDB aggregation framework with application code. For example, Spark offers native support for advanced machine learning algorithms including k-means clustering and Gaussian mixture models.
Consider a web analytics platform that uses the MongoDB aggregation framework to maintain a real time dashboard displaying the number of clicks on an article by country; how often the article is shared across social media; and the number of shares by platform. With this data, analysts can quickly gain insight on how content is performing, optimizing user’s experience for posts that are trending, with the ability to deliver critical feedback to the editors and ad-tech team.
Spark’s machine learning algorithms can also be applied to the log, clickstream and user data stored in MongoDB to build precisely targeted content recommendations for its readers. Multi-class classifications are run to divide articles into granular sub-categories, before applying logistic regression and decision tree methods to match readers’ interests to specific articles. The recommendations are then served back to users through MongoDB, as they browse the site.
2. Processing Paradigm
Many programming languages can use their own MongoDB drivers to execute queries against the database, returning results to the application where additional analytics can be run using standard machine learning and statistics libraries. For example, a developer could use the MongoDB Python or R drivers to query the database, loading the result sets into the application tier for additional processing.
However, this starts to become more complex when an analytical job in the application needs to be distributed across multiple threads and nodes. While MongoDB can service thousands of connections in parallel, the application would need to partition the data, distribute the processing across the cluster, and then merge results. Spark makes this kind of distributed processing easier and faster to develop. MongoDB exposes operational data to Spark’s distributed processing layer to provide fast, real-time analysis. Combining Spark queries with MongoDB indexes allows data to be filtered, avoiding full collection scans and delivering low-latency responsiveness with minimal hardware and database overhead.
3. Skills Re-Use
With libraries for SQL, machine learning and others –combined with programming in Java, Scala and Python –developers can leverage existing skills and best practices to build sophisticated analytics workflows on top of MongoDB.

Don't fear failure in the first attempt because even the successful maths started with 'zero' only

Sunday, 22 October 2017

Spark on MongoDB: Part- 2

MongoDB data management

1. Sharding

MongoDB provides horizontal scale-out for databases on low cost, commodity hardware or cloud infrastructure using a technique called sharding, which is transparent to applications. Sharding distributes data across multiple physical partitions called shards. Sharding allows MongoDB deployments to address the hardware limitations of a single server, such as bottlenecks in RAM or disk I/O, without adding complexity to the application. MongoDB automatically balances the data in the sharded cluster as the data grows or the size of the cluster increases or decreases.

  • Range-based Sharding: Documents are partitioned across shards according to the shard key value. Documents with shard key values “close” to one another are likely to be co-located on the same shard. This approach is well suited for applications that need to optimize range-based queries. 
  • Hash-based Sharding: Documents are uniformly distributed according to an MD5 hash of the shard key value. This approach guarantees a uniform distribution of writes across shards, but is less optimal for range-based queries. 
  • Location-aware Sharding: Documents are partitioned according to a user-specified configuration that associates shard key ranges with specific shards and hardware. Users can continuously refine the physical location of documents for application requirements such as locating data in specific data centers or on different storage types (i.e. the In-Memory engine for “hot” data, and disk-based engines running on SSDs for warm data, and HDDs for aged data).

2. Consistency

MongoDB is ACID compliant at the document level. One or more fields may be written in a single operation, including updates to multiple sub-documents and elements of an array. The ACID property provided by MongoDB ensures complete isolation as a document is updated; any errors cause the operation to roll back and clients receive a consistent view of the document. MongoDB also allows users to specify write availability in the system using an option called the write concern. The default write concern acknowledges writes from the application, allowing the client to catch network exceptions and duplicate key errors. Developers can use MongoDB's Write Concerns to configure operations to commit to the application only after specific policies have been fulfilled. Each query can specify the appropriate write concern, ranging from unacknowledged to acknowledgement that writes have been committed to all replicas.

3. Availability

Replication: MongoDB maintains multiple copies of data called replica sets using native replication. A replica set is a fully self-healing shard that helps prevent database downtime and can be used to scale read operations. Replica failover is fully automated, eliminating the need for administrators to intervene manually. A replica set consists of multiple replicas. At any given time one member acts as the primary replica set member and the other members act as secondary replica set members. MongoDB is strongly consistent by default: reads and writes are issued to a primary copy of the data. If the primary member fails for any reason (e.g., hardware failure, network partition) one of the secondary members is automatically elected to primary and begins to process all writes. Up to 50 members can be provisioned per replica set.
Applications can optionally read from secondary replicas, where data is eventually consistent by default. Reads from secondaries can be useful in scenarios where it is acceptable for data to be slightly out of date, such as some reporting applications. For data-center aware reads, applications can also read from the closest copy of the data as measured by ping distance to reduce the effects of geographic latency. Replica sets also provide operational flexibility by providing a way to upgrade hardware and software without requiring the database to go offline. MongoDB achieves replication by the use of replica set. A replica set is a group of mongod instances that host the same data set. In a replica one node is primary node that receives all write operations. All other instances, apply operations from the primary so that they have the same data set. Replica set can have only one primary node.
  1. Replica set is a group of two or more nodes (generally minimum 3 nodes are required).
  2. In a replica set one node is primary node and remaining nodes are secondary.
  3. All data replicates from primary to secondary node.
  4. At the time of automatic failover or maintenance, election establishes for primary and a new primary node is elected.
  5. After the recovery of failed node, it again join the replica set and works as a secondary node. 

Client application always interacts with primary node and primary node then replicates the data to the secondary nodes.
ReplicasetOplog: Operations that modify a database on the primary replica set member are replicated to the secondary members using the oplog(operations log).The size of the oplog is configurable and by default is 5% of the available free disk space. For most applications, this size represents many hours of operations and defines the recovery window for a secondary, should this replica go offline for some period of time and need to catch up to the primary when it recovers. If a secondary replica set member is down for a period longer than is maintained by the oplog, it must be recovered from the primary replica using a process called initial synchronization. During this process all databases and their collections are copied from the primary or another replica to the secondary as well as the oplog and then the indexes are built.
Elections and failover: Replica sets reduce operational overhead and improve system availability. If the primary replica for a shard fails, secondary replicas together determine which replica should be elected as the new primary using an extended implementation of the Raft consensus algorithm. Once the election process has determined the new primary, the secondary members automatically start replicating from it. If the original primary comes back online, it will recognize its change in state and automatically assume the role of a secondary.
Election Priority: A new primary replica set member is promoted within several seconds of the original primary failing. During this time, queries configured with the appropriate read preference can continue to be serviced by secondary replica set members. The election algorithms evaluate a range of parameters including analysis of election identifiers and timestamps to identify that replica set members that have applied the most recent updates from the primary; heartbeat and connectivity status; and user-defined priorities assigned to replica set members. In an election, the replica set elects an eligible member with the highest priority value as primary. By default, all members have a priority of 1 and have an equal chance of becoming primary; however, it is possible to set priority values that affect the likelihood of a replica becoming primary.

4. Storage

With MongoDB’s flexible storage architecture, the database automatically manages the movement of data between storage engine technologies using nativereplication. This approach significantly reduces developer and operational complexity when compared to running multiple distinct database technologies. MongoDB 3.2 ships with four supported storage engines, all of which can coexist within a single MongoDB replica set.

  • The default WiredTiger storage engine. For many applications, WiredTiger's granular concurrency control and native compression will provide the best all round performance and storage efficiency for the broadest range of applications. 
  • The Encrypted storage engine protecting highly sensitive data, without the performance or management overhead of separate file system encryption.(Requires MongoDB Enterprise Advanced).
  • The In-Memory storage engine delivering the extreme performance coupled with real time analytics for the most demanding, latency-sensitive applications. (Requires MongoDB Enterprise Advanced). 
  • The MMAPv1 engine, an improved version of the storage engine used in pre-3.x MongoDB releases

  • Administrators can modify the default compression settings for all collections and indexes.
5. Security


MongoDB Enterprise Advanced features extensive capabilities to defend, detect, and control access to data:

  • Authentication: Simplifying access control to the database, MongoDB offers integration with external security mechanisms including LDAP, Windows Active Directory, Kerberos, and x.509 certificates. 
  • Authorization: User-defined roles enable administrators to configure granular permissions for a user or an application based on the privileges they need to do their job. Additionally, field-level redaction can work with trusted middleware to manage access to individual fields within a document, making it possible to co-locate data with multiple security levels in a single document. 
  • Auditing: For regulatory compliance, security administrators can use MongoDB's native audit log to track any operation taken against the database – whether DML, DCL or DDL. 
  • Encryption: MongoDB data can be encrypted on the network and on disk. With the introduction of the Encrypted storage engine, protection of data at-rest now becomes an integral feature within the database. By natively encrypting database files on disk, administrators eliminate both the management and performance overhead of external encryption mechanisms. Now, only those staff who have the appropriate database authorization credentials can access the encrypted data, providing additional levels of defence.


Chase your Dreams

Wednesday, 22 March 2017

Block chain in bits & pieces


Blockchain system is a package which contains a normal database plus some software that adds new rows, validates that new rows conform to pre-agreed rules, and listens and broadcasts new rows to its peers across a network, ensuring that all peers have the same data in their databases. The Bitcoin Blockchain ecosystem is actually quite a complex system due to its dual aims: that anyone should be able to write to the Bitcoin Blockchain; and that there shouldn’t be any centralised power or control.

Bitcoin

A 2008 whitepaper entitled "Bitcoin: A Peer-to-Peer Electronic Cash System" written by the Satoshi Nakamoto introduced the concept of bitcoin. It is described as "a purely peer-to-peer version of electronic cash [which] would allow online payments to be sent directly from one party to another without going through a financial institution".It can be thought like an international currency which can be used for transactions in internet. As an electronic asset, we can buy bitcoins, own them, and send them to someone else. Transactions of bitcoins from account to account are recognised globally in a matter of seconds, and can be considered securely settled within an hour, usually.  They have a price, and the price is set by normal supply and demand market forces in marketplaces where traders come to trade.
So, there is the concept of electronic cash: cash being a bearer asset, like the cash in wer pocket which we can spend at will without asking permission from a third party.

How does it work?

A network of computers validates and keeps track of bitcoin payments, and ensures that they are recorded by being added to an ever-growing list of all the bitcoin payments that have been made.
When we make a bitcoin payment, a payment instruction is sent to the network.  The computers on the network validate the instruction and relay it to the other computers.  After some time has passed, the payment gets included in one of the block updates, and is added to The Bitcoin Blockchain file on all the computers across the network.
Peer-to-peer:  Peer-to-peer is like a gossip network where everyone tells a few other people the news (about new transactions and new blocks), and eventually the message gets to everyone in the network. One benefit of peer-to-peer (p2p) over client-server is that with p2p, the network doesn’t rely on one central point of control which can fail.

How are bitcoins stored?

Bitcoin ownership is tracked on The Bitcoin Blockchain, and bitcoins are associated with “bitcoin addresses”.  Bitcoins themselves are not stored; but rather the keys or passwords needed to make payments are stored, in “wallets” which are apps that manage the addresses, keys, balances, and payments. 
Bitcoin addresses: A bitcoin address is similar to a bank account number.
Bitcoin wallets: Bitcoin wallets are apps that display all of wer bitcoin addresses, display balances and make it easy to send and receive payments. For a wallet to provide accurate information, it needs to be online or connected to a Bitcoin Blockchain file, which it uses as its source of information.  The wallet will read the Bitcoin Blockchain file and calculate the balances in each address.

How are bitcoins sent?

Payments, or bitcoin transactions
Each bitcoin address has its own private key, which is needed to send payments from that address. Whoever knows this private key, can now make payments from the address. wallet software is used to get this private key.
Private key: Because we can not change that private key to something more memorable, it can be a pain to remember.  Most wallet apps will encrypt that key with a password that we choose.  Later, when we want to make a payment, we just need to remember wer password. Bitcoin wallets don’t store bitcoins but store the keys that let us transfer or ‘spend’ them.

What happens when I make a bitcoin payment?

A payment is an instruction to unlink some bitcoins from an address we control, and move them to the control of another address (your recipient).Our payment instruction includes:

  • which bitcoins we’re sending
  • which address we’re sending them from
  • which address we’re sending them to

Digital cryptographic signatures:  The instruction is then digitally signed with the private key of the address which currently holds the bitcoins.  This digital signing demonstrates that we are owner of the address in question.
Validators:  When the first computer receives the instruction, it checks some technical details, and some business logic details. The same tests are done in all computers in the network. Eventually all computers on the network know about this payment, and it appears on screens everywhere in the world as an “unconfirmed transaction”.  It is unconfirmed because although the payment has been verified and passed around, it isn’t entered into the ledger yet.

How are bitcoins tracked?

Specialised nodes in the network, work to add the bitcoin transactions, in blocks, to the blockchain.  This is known as bitcoin mining. Mining is a guessing game where your chance of winning is related to the how quickly your machine can perform calculations compared to how quickly other miners are performing similar calculations.  Whoever guesses the right number first wins the right to add a new block of transactions to everyone’s blockchains, and does this by publishing this to the other computers on the network.  Each computer performs a quick validation of the block, and they agree that the block and transactions conform to the rules, then they add the block to their own blockchain.

Bitcoin security

Making payments: Bitcoin private keys are used to make payments.
Block control: There are two parts to this.  Firstly there is block-creation (“mining”), performed by some specialised nodes; secondly there is block validation, which is performed by all nodes.

References
https://bitsonblocks.net/2015/09/09/a-gentle-introduction-to-blockchain-technology/
https://bitsonblocks.net/2015/09/01/a-gentle-introduction-to-bitcoin/
https://bitcoin.org/bitcoin.pdf

It only takes a split second to smile, yet to someone that needed it, it can last a lifetime.

Saturday, 7 January 2017

Useful Queries in Neo4j

1. Create nodes with label Employee and Department
CREATE (emp:Employee{id:"1001",name:"Tom",dob:"01/10/1982"})
CREATE (dept:Department { deptno:10,dname:"Accounting",location:"Hyderabad" })

2. Create node and relationship
CREATE (emp:Employee{id:"1001"})-[:WORKS AT{since:"2 days"}]->(dept:Department { deptno:10}) 

3. Selection
MATCH (n) RETURN n : returns all nodes and relationships
MATCH (emp:Employee{id:"1001"}) RETURN emp : returns the nodes with id as 1001.
MATCH (emp:Employee) WHERE emp.id="1001" RETURN emp : returns the same result as previous query, but is not preferred ue to performance impacts

4. Limiting results
MATCH (n) RETURN n LIMIT 10

5. Deletion
MATCH(n) DELETE(n)

6. Delete nodes and relationships(A node having relationship can only be deleted if all the relationships associated with that node is deleted.)
MATCH (n) DETACH DELETE n

7. Return attribute names of a node
MATCH (n) RETURN keys(n)

8. Get attributes along with values
MATCH (n) RETURN properties(n)

9. Regular expressions
MATCH (emp:Employee) 
WHERE emp.id=~'\\d+'
RETURN (emp)

10. Get create timestamp of a node
Match (n) RETURN timestamp()

11. Find all nodes that have a relationship with the designated label, and RETURN selected properties of the nodes and the relationships
MATCH (n)-[r:WORKS_AT]->()
RETURN n,r.since

12. Return distinct in neo4j
MATCH (n1)-[r:WORKS_AT]->(n2)
RETURN DISTINCT r.since

13. Aggregation
MATCH (n1)-[r:WORKS_AT]->(n2)
RETURN r.since,COUNT(*)

14. Create-Update-Delete properties
Merge (emp:Employee{id:"1001"}) ON CREATE SET emp.city:"Kochi",emp.dob:"01/10/1982" ON MATCH  SET name:"Michael" REMOVE emp.id

15. Return any relationship
MATCH (n)--(m) RETURN n

16. Find count of relationship in both directions
MATCH (n)
RETURN n.name,size((n)-->()) as outcount, size((n)<--()) as incount



The best preparation for tomorrow is doing best today

Monday, 31 October 2016

Spark on MongoDB- Part: 1

MongoDB
MongoDB is document oriented storage. It was developed by MongoDB Inc.(formerly 10 gen) in 2007. It has APIs in various programming languages such as JavaScript, Python, Ruby, Perl, Java, Java, Scala, C#, C++, Haskell, Erlang etc. It supports built in horizontal scaling by dividing the system dataset and loading over multiple servers. MongoDB supports horizontal scaling by sharding. MongoDB supports embedded documents which eliminates the need for complex joins. It also provide full index support and high availability through replication.
Nexus Architecture
MongoDB’s design philosophy is focused on combining the critical capabilities of relational databases with the innovations of NoSQL technologies. It adopts the following features from relational databases:

  • Expressive query language & secondary Indexes. Users should be able to access and manipulate their data in sophisticated ways to support both operational and analytical applications. Indexes play a critical role in providing efficient access to data, supported natively by the database rather than maintained in application code.
  • Strong consistency. Applications should be able to immediately read what has been written to the database.
  • Enterprise Management and Integrations. Databases are just one piece of application infrastructure, and need to fit seamlessly into the enterprise IT stack. Organizations need a database that can be secured, monitored, automated, and integrated with their existing technology infrastructure, processes, and staff, including operations teams, DBAs, and data analysts.
However, modern applications impose requirements not addressed by relational databases, and this has driven the development of NoSQL databases which offer: 
  • Flexible Data Model. Whether document, graph, key-value, or wide-column, all of them offer a flexible data model, making it easy to store and combine data of any structure and allow dynamic modification of the schema without downtime or performance impact. 
  • Scalability and Performance. NoSQL databases were all built with a focus on scalability, so they all include some form of sharding or partitioning. This allows the database to scale out on commodity hardware deployed on-premises or in the cloud, enabling almost unlimited growth with higher throughput and lower latency than relational databases. 
  • Always-On Global Deployments. NoSQL databases are designed for highly available systems that provide a consistent, high quality experience for users all over the world. They are designed to run across many nodes, including replication to automatically synchronize data across servers, racks, and data centers.

With its Nexus Architecture, MongoDB is the only database that harnesses the innovations of NoSQL while maintaining the foundation of relational databases.
Data Model
MongoDB stores data as documents in a binary representation called BSON (Binary JSON). The BSON encoding extends the popular JSON (JavaScript Object Notation) representation to include additional types such as int, long, date, and floating point. BSON documents contain one or more fields, and each field contains a value of a specific data type, including arrays, binary data and sub-documents.
For example, consider the data model for a blogging application. In a relational database the data model would comprise multiple tables. To simplify the example, assume there are tables for Categories, Tags, Users, Comments and Articles. In MongoDB the data could be modeled as two collections, one for users, and the other for articles. In each blog document there might be multiple comments, multiple tags, and multiple categories, each expressed as an embedded array.
MongoDB documents tend to have all data for a given record in a single document, whereas in a relational database information for a given record is usually spread across many tables. With the MongoDB document model, data is more localized, which significantly reduces the need to JOIN separate tables. The result is dramatically higher performance and scalability across commodity hardware as a single read to the database can retrieve the entire document containing all related data. Unlike many NoSQL databases, users don’t need to give up JOINs entirely. For additional analytics flexibility, MongoDB preserves left-outer JOIN semanticswith the $lookup operator, enabling users to get the best of both relational and non-relational data modeling. MongoDB BSON documents are closely aligned to the structure of objects in the programming language. This makes it simpler and faster for developers to model how data in the application will map to data stored in the database.
Fields can vary from document to document; there is no need to declare the structure of documents to the system – documents are self-describing. If a new field needs to be added to a document then the field can be created without affecting all other documents in the system, without updating a central system catalog, and without taking the system offline. The data model is aligned to the structure of objects in the programming language. This makes it simpler and faster for developers to model how data in the application will map to data stored in the database.
Unlike NoSQL  database, MongoDB is not limited to simple Key-Value operations. A key element of this flexibility is MongoDB's support for many types of queries. A query may return a document, a subset of specific fields within the document or complex aggregations against many documents: 
  • Key-value queries return results based on any field in the document, often the primary key. 
  • Range queries return results based on values defined as inequalities (e.g, greater than, less than or equal to, between).
  • Geospatial queries return results based on proximity criteria, intersection and inclusion as specified by a point, line, circle or polygon. 
  • Text Search queries return results in relevance order based on text arguments using Boolean operators (e.g., AND, OR, NOT). 
  • Aggregation Framework queries return aggregations of values returned by the query (e.g., count, min, max, average, similar to a SQL GROUP BY statement). 
  • MapReduce queries execute complex data processing that is expressed in JavaScript and executed across data in the database.
Typical MongoDB Deployment
Applications issue queries to a query router that dispatches the query to the appropriate shards. For key-value queries that are based on the shard key, the query router will dispatch the query to the shard that manages the document with the requested key. When using range-based sharding, queries that specify ranges on the shard key are only dispatched to shards that contain documents with values within the range. For queries that don’t use the shard key, the query router will broadcast the query to all shards, aggregating and sorting the results as appropriate. Multiple query routers can be used with a MongoDB system, and the appropriate number is determined based on performance and availability requirements of the application.

Reference: https://docs.mongodb.com/

All might be wondering the title of this post is 'Spark on MongoDB' and till now we have been discussing only about MongoDB. We 'll see more features of MongoDB and Why use Spark on MongoDB in the next post..

Every Day is a new Beginning!