When to use Hadoop
- Support for multiple frameworks: hadoop can be integrated with multiple analytical tools like R and Python for Analytics and visualisation, Python and Spark for real-time processing, MongoDB and HBase for NoSQL database, Pentaho for BI etc
- Data size and Data diversity
- Lifetime data availability due to scalability and fault tolerance.
Hadoop Namenode failover process
In a High Availability cluster, two separate machines are configured as NameNodes. One of the NameNodes is in an Active state and the other is in a Standby state. The Active NameNode is responsible for all client operations in the cluster, while the Standby is simply acting as a slave.In order for the Standby node to keep its state synchronized with the Active node, both nodes communicate with a group of separate daemons called “JournalNodes” (JNs). When any namespace modification is performed by the Active node, it durably logs a record of the modification to a majority of these JNs. The Standby node is capable of reading the edits from the JNs, and is constantly watching them for changes to the edit log. As the Standby Node sees the edits, it applies them to its own namespace.
In the event of a failover, the Standby will ensure that it has read all of the edits from the JounalNodes before promoting itself to the Active state. Standby node have up-to-date information regarding the location of blocks in the cluster. In order to achieve this, the DataNodes are configured with the location of both NameNodes, and send block location information and heartbeats to both.
During a failover, the NameNode which is to become active will simply take over the role of writing to the JournalNodes, which will effectively prevent the other NameNode from continuing in the Active state.
During a failover, the NameNode which is to become active will simply take over the role of writing to the JournalNodes, which will effectively prevent the other NameNode from continuing in the Active state.
Ways to rebalance the cluster when new datanodes are added
- Select a subset of files that take up a good percentage of your disk space; copy them to new locations in HDFS; remove the old copies of the files; rename the new copies to their original names.
- Way, with no interruption of service, is to turn up the replication of files, wait for transfers to stabilize, and then turn the replication back down.
- Turn off the data-node, which is full, wait until its blocks are replicated, and then bring it back again. The over-replicated blocks will be randomly removed from different nodes.Execute the bin/start-balancer.sh command to run a balancing process to move blocks around the cluster automatically.
Actual data storage locations for NameNode and DataNode
A list of comma separated pathnames can be specified as dfs.datanode.data.dir for data storage in datanodes. The dfs.namenode.name.dir parameter is used to specify the namenode directories to store data.
Limiting DataNode's disk usage
The configuration dfs.datanode.du.reserved configuration in $HADOOP_HOME/conf/hdfs-site.xml can be used to limit disk usage.
The configuration dfs.datanode.du.reserved configuration in $HADOOP_HOME/conf/hdfs-site.xml can be used to limit disk usage.
Removing datanodes from a cluster
Removing one or two data-nodes will not lead to any data loss, because name-node will replicate their blocks as long as it will detect that the nodes are dead.Hadoop offers the decommission feature to retire a set of existing data-nodes. The nodes to be retired should be included into the exclude file, and the exclude file name should be specified as a configuration parameter dfs.hosts.exclude. Specify the full hostname, ip or ip:port format in this file. Then the shell command
bin/hadoop dfsadmin -refreshNodes
should be called, which forces the name-node to re-read the exclude file and start the decommission process.
The decommission progress can be monitored on the name-node Web UI. Until all blocks are replicated the node will be in "Decommission In Progress" state. When decommission is done the state will change to "Decommissioned".
Files and block sizes
HDFS provides API to specify block size when creating a file. Hence multiple files can have different block sizes. FileSystem.create(path,overwrite, bufferSize,replication,blockSize,progress)Hadoop streaming
Hadoop has a generic API for writing map reduce programs in any desired programming language like Python, Ruby, Perl etc. This is called Hadoop streaming.Inter cluster data copy
Hadoop provides distCP(distributed copy) command to copy data across different Hadoop clusters.Be the best thing that ever happen to everyone