Tuesday, 24 October 2017

Spark on MongoDB: Part- 3

Terminologies



Mongo DB Processes and Configurations


  • mongod – Database instance
  • mongos - Sharding processes
Analogous to a database router.
Processes all requests
Decides how many and which mongods should receive the query
Mongos collates the results, and sends it back to the client.
  • mongo – an interactive shell (a client)
Fully functional JavaScript environment for use with a Mongo DB 

Getting started with Mongo DB


  1. To install mongo DB, go to this link and click on the appropriate OS and architecture: http://www.mongodb.org/downloads 
  2. Extract the files
  3. Create a data directory for mongoDB to use
  4. Open the mongodb/bin directory and run mongod.exe to start the database server.
  5. To establish a connection to the server, open another command prompt window and go to the same directory, entering in mongo.exe. 

Mongo DB CRUD Operation

1. Create 

db.collection.insert( <document> ) 
Eg: db.users.insert(
{ name: “sally”, salary: 15000, designation: “MTS”, teams: [ “cluster-management” ] }

db.collection.save( <document> ) 
2. Read 
db.collection.find( <query>, <projection> ) 
Eg: db.collection.find(
{ qty: { $gt: 4 } },
{ name: 1}

db.collection.findOne( <query>, <projection> ) 
3. Update 
db.collection.update( <Update Criteria>, <Update Action>, < Update Option> ) 
Eg: db.user.update( 
{salary:{$gt:18000}}, 
{$set: {designation: “Manager”}}, 
{multi: true} 
)
4. Delete 
db.collection.remove( <query>, <justOne> ) 
Eg: db.user.remove(
{ “name" : “sally" },
{justOne: true} 

)
When to Use Spark with MongoDB

While MongoDB natively offers rich analytics capabilities, there are situations where integrating the Spark engine can extend the real-time processing of operational data managed by MongoDB, and allow users to operationalize results generated from Spark within real-time business processes supported by MongoDB.
Spark can take advantage of MongoDB’s rich secondary indexes to extract and process only the range data it needs– for example, analyzing all customers located in a specific geography. This is very different from other databases that either do not offer, or do not recommend the use of secondary indexes. In these cases, Spark would need to extract all data based on a simple primary key, even if only a subset of that data is required for the Spark process. This means more processing overhead, more hardware, and longer time-to-insight for the analyst.
Examples of where it is useful to combine Spark and MongoDB include the following.
1. Rich Operators & Algorithms
Spark supports over 100 different operators and algorithms for processing data. Developers can use this to perform advanced computations that would otherwise require more programmatic effort combining the MongoDB aggregation framework with application code. For example, Spark offers native support for advanced machine learning algorithms including k-means clustering and Gaussian mixture models.
Consider a web analytics platform that uses the MongoDB aggregation framework to maintain a real time dashboard displaying the number of clicks on an article by country; how often the article is shared across social media; and the number of shares by platform. With this data, analysts can quickly gain insight on how content is performing, optimizing user’s experience for posts that are trending, with the ability to deliver critical feedback to the editors and ad-tech team.
Spark’s machine learning algorithms can also be applied to the log, clickstream and user data stored in MongoDB to build precisely targeted content recommendations for its readers. Multi-class classifications are run to divide articles into granular sub-categories, before applying logistic regression and decision tree methods to match readers’ interests to specific articles. The recommendations are then served back to users through MongoDB, as they browse the site.
2. Processing Paradigm
Many programming languages can use their own MongoDB drivers to execute queries against the database, returning results to the application where additional analytics can be run using standard machine learning and statistics libraries. For example, a developer could use the MongoDB Python or R drivers to query the database, loading the result sets into the application tier for additional processing.
However, this starts to become more complex when an analytical job in the application needs to be distributed across multiple threads and nodes. While MongoDB can service thousands of connections in parallel, the application would need to partition the data, distribute the processing across the cluster, and then merge results. Spark makes this kind of distributed processing easier and faster to develop. MongoDB exposes operational data to Spark’s distributed processing layer to provide fast, real-time analysis. Combining Spark queries with MongoDB indexes allows data to be filtered, avoiding full collection scans and delivering low-latency responsiveness with minimal hardware and database overhead.
3. Skills Re-Use
With libraries for SQL, machine learning and others –combined with programming in Java, Scala and Python –developers can leverage existing skills and best practices to build sophisticated analytics workflows on top of MongoDB.

Don't fear failure in the first attempt because even the successful maths started with 'zero' only

No comments:

Post a Comment