Monday 31 October 2016

Spark on MongoDB- Part: 1

MongoDB
MongoDB is document oriented storage. It was developed by MongoDB Inc.(formerly 10 gen) in 2007. It has APIs in various programming languages such as JavaScript, Python, Ruby, Perl, Java, Java, Scala, C#, C++, Haskell, Erlang etc. It supports built in horizontal scaling by dividing the system dataset and loading over multiple servers. MongoDB supports horizontal scaling by sharding. MongoDB supports embedded documents which eliminates the need for complex joins. It also provide full index support and high availability through replication.
Nexus Architecture
MongoDB’s design philosophy is focused on combining the critical capabilities of relational databases with the innovations of NoSQL technologies. It adopts the following features from relational databases:

  • Expressive query language & secondary Indexes. Users should be able to access and manipulate their data in sophisticated ways to support both operational and analytical applications. Indexes play a critical role in providing efficient access to data, supported natively by the database rather than maintained in application code.
  • Strong consistency. Applications should be able to immediately read what has been written to the database.
  • Enterprise Management and Integrations. Databases are just one piece of application infrastructure, and need to fit seamlessly into the enterprise IT stack. Organizations need a database that can be secured, monitored, automated, and integrated with their existing technology infrastructure, processes, and staff, including operations teams, DBAs, and data analysts.
However, modern applications impose requirements not addressed by relational databases, and this has driven the development of NoSQL databases which offer: 
  • Flexible Data Model. Whether document, graph, key-value, or wide-column, all of them offer a flexible data model, making it easy to store and combine data of any structure and allow dynamic modification of the schema without downtime or performance impact. 
  • Scalability and Performance. NoSQL databases were all built with a focus on scalability, so they all include some form of sharding or partitioning. This allows the database to scale out on commodity hardware deployed on-premises or in the cloud, enabling almost unlimited growth with higher throughput and lower latency than relational databases. 
  • Always-On Global Deployments. NoSQL databases are designed for highly available systems that provide a consistent, high quality experience for users all over the world. They are designed to run across many nodes, including replication to automatically synchronize data across servers, racks, and data centers.

With its Nexus Architecture, MongoDB is the only database that harnesses the innovations of NoSQL while maintaining the foundation of relational databases.
Data Model
MongoDB stores data as documents in a binary representation called BSON (Binary JSON). The BSON encoding extends the popular JSON (JavaScript Object Notation) representation to include additional types such as int, long, date, and floating point. BSON documents contain one or more fields, and each field contains a value of a specific data type, including arrays, binary data and sub-documents.
For example, consider the data model for a blogging application. In a relational database the data model would comprise multiple tables. To simplify the example, assume there are tables for Categories, Tags, Users, Comments and Articles. In MongoDB the data could be modeled as two collections, one for users, and the other for articles. In each blog document there might be multiple comments, multiple tags, and multiple categories, each expressed as an embedded array.
MongoDB documents tend to have all data for a given record in a single document, whereas in a relational database information for a given record is usually spread across many tables. With the MongoDB document model, data is more localized, which significantly reduces the need to JOIN separate tables. The result is dramatically higher performance and scalability across commodity hardware as a single read to the database can retrieve the entire document containing all related data. Unlike many NoSQL databases, users don’t need to give up JOINs entirely. For additional analytics flexibility, MongoDB preserves left-outer JOIN semanticswith the $lookup operator, enabling users to get the best of both relational and non-relational data modeling. MongoDB BSON documents are closely aligned to the structure of objects in the programming language. This makes it simpler and faster for developers to model how data in the application will map to data stored in the database.
Fields can vary from document to document; there is no need to declare the structure of documents to the system – documents are self-describing. If a new field needs to be added to a document then the field can be created without affecting all other documents in the system, without updating a central system catalog, and without taking the system offline. The data model is aligned to the structure of objects in the programming language. This makes it simpler and faster for developers to model how data in the application will map to data stored in the database.
Unlike NoSQL  database, MongoDB is not limited to simple Key-Value operations. A key element of this flexibility is MongoDB's support for many types of queries. A query may return a document, a subset of specific fields within the document or complex aggregations against many documents: 
  • Key-value queries return results based on any field in the document, often the primary key. 
  • Range queries return results based on values defined as inequalities (e.g, greater than, less than or equal to, between).
  • Geospatial queries return results based on proximity criteria, intersection and inclusion as specified by a point, line, circle or polygon. 
  • Text Search queries return results in relevance order based on text arguments using Boolean operators (e.g., AND, OR, NOT). 
  • Aggregation Framework queries return aggregations of values returned by the query (e.g., count, min, max, average, similar to a SQL GROUP BY statement). 
  • MapReduce queries execute complex data processing that is expressed in JavaScript and executed across data in the database.
Typical MongoDB Deployment
Applications issue queries to a query router that dispatches the query to the appropriate shards. For key-value queries that are based on the shard key, the query router will dispatch the query to the shard that manages the document with the requested key. When using range-based sharding, queries that specify ranges on the shard key are only dispatched to shards that contain documents with values within the range. For queries that don’t use the shard key, the query router will broadcast the query to all shards, aggregating and sorting the results as appropriate. Multiple query routers can be used with a MongoDB system, and the appropriate number is determined based on performance and availability requirements of the application.

Reference: https://docs.mongodb.com/

All might be wondering the title of this post is 'Spark on MongoDB' and till now we have been discussing only about MongoDB. We 'll see more features of MongoDB and Why use Spark on MongoDB in the next post..

Every Day is a new Beginning!