History
Hadoop was created by Doug Cutting and Mike Cafarella in 2005. Doug was working at Yahoo at that time and is now Chief Architect of Cloudera. Hadoop was named after his son's toy elephant.
Hadoop
Apache Hadoop is a framework that provides various tools to store and process Big Data. It helps in analyzing Big Data and making business decisions. Hadoop stands for High Availability Distributed Object Oriented Platform.
Latest version
The latest version of Hadoop is 3.1.2 released on Feb 6, 2019
Companies using Hadoop
Cloudera, Amazon Web Services, IBM, Hortonworks, Intel, Microsoft etc
Top vendors offering Hadoop distribution
Cloudera, HortonWorks, Amazon Web Services Elastic MapReduce Hadoop Distribution, Microsoft, MapR, IBM etc
Advantages of Hadoop distributions
- Technical Support
- Consistent with patches, fixes and bug detection
- Extra components for monitoring
- Easy to install
Modes of Hadoop
Hadoop can run in three modes:
- Standalone- Default mode of Hadoop. It uses local file system for input and output operations. It is much faster when compared to other modes and is mainly used for debugging purpose.
- Pseudo distributed(Single Node Cluster)- In this case all daemons are running on one node and thus both Master and Slave node are the same.
- Fully distributed(Multiple Node Cluster)- Here separate nodes are allotted as Master and Slave. The data is distributed across several nodes on Hadoop cluster.
Main components of Hadoop
There are two main components namely:
- Storage unit– HDFS
- Processing framework– YARN
HDFS
HDFS (Hadoop Distributed File System) is the storage unit of Hadoop. It is responsible for storing different kinds of data in a distributed environment. It follows master - slave architecture.
Components of HDFS
- NameNode: NameNode is the master node which is responsible for storing the metadata of all the files and directories such as block location, replication factors etc. It has information about blocks, that make a file, and where those blocks are located in the cluster. NameNode uses two files for storing the metadata namely:
Fsimage- It keeps track of the latest checkpoint of the namespace.
Edit log- It is the log of changes that have been made to the namespace since checkpoint.
- DataNode: DataNodes are the slave nodes, which are responsible for storing data in the HDFS. NameNode manages all the DataNodes.
YARN
YARN (Yet Another Resource Negotiator) is the processing framework in Hadoop, which manages resources and provides an execution environment to the processes.
Components of YARN
- ResourceManager: It receives the processing requests, and then passes the requests to corresponding NodeManagers accordingly, where the actual processing takes place. It allocates resources to applications based on the needs. It is the central authority that manages resources and schedule applications running on top of YARN.
- NodeManager: NodeManager is installed on every DataNode and it is responsible for the execution of the task on every DataNode. It runs on slave machines, and is responsible for launching the application’s containers (where applications execute their part), monitoring their resource usage (CPU, memory, disk, network) and reporting these to the ResourceManager.
Hadoop daemons
Hadoop daemons can be broadly divided into three namely:
- HDFS daemons- NameNode, DataNode, Secondary NameNode
- YARN daemons- ResourceManager, NodeManager
- JobHistoryServer
Secondary NameNode
It periodically merges the changes (edit log) with the FsImage (Filesystem Image), present in the NameNode. It stores the modified FsImage into persistent storage, which can be used in case of failure of NameNode.
JobHistoryServer
It maintains information about MapReduce jobs after the Application Master terminates.
He Who has a Why to live for, can bear almost any How