Security in Hadoop
Apache Hadoop achieves security by using Kerberos.At a high level, there are three steps that a client must take to access a service when using Kerberos.
- Authentication – The client authenticates itself to the authentication server. Then, receives a timestamped Ticket-Granting Ticket (TGT).
- Authorization – The client uses the TGT to request a service ticket from the Ticket Granting Server.
- Service Request – The client uses the service ticket to authenticate itself to the server.
Concurrent writes in HDFS
Multiple clients cannot write into an HDFS file at same time. Apache Hadoop HDFS follows single writer multiple reader models. The client which opens a file for writing, the NameNode grant a lease. Now suppose, some other client wants to write into that file. It asks NameNode for the write operation. NameNode first checks whether it has granted the lease for writing into that file to someone else or not. When someone already acquires the lease, then, it will reject the write request of the other client.fsck
fsck is the File System Check. HDFS use the fsck (filesystem check) command to check for various inconsistencies. It also reports the problems like missing blocks for a file or under-replicated blocks. NameNode automatically corrects most of the recoverable failures. Filesystem check can run on the whole file system or on a subset of files.Datanode failures
NameNode periodically receives a Heartbeat and a Blockreport from each of the DataNodes in the cluster. Receipt of a Heartbeat implies that the DataNode is functioning properly. A Blockreport contains a list of all blocks on a DataNode. When NameNode notices that it has not recieved a hearbeat message from a data node after a certain amount of time, the data node is marked as dead. Since blocks will be under replicated the system begins replicating the blocks that were stored on the dead datanode. The NameNode Orchestrates the replication of data blocks from one datanode to another. The replication data transfer happens directly between datanodes and the data never passes through the namenode.Taskinstances
Task instances are the actual MapReduce jobs which are run on each slave node. Each Task Instance runs on its own JVM process. There can be multiple processes of task instance running on a slave node.Communication to HDFS
The Client communication to HDFS happens using Hadoop HDFS API. Client applications talk to the NameNode whenever they wish to locate a file. The NameNode responds the successful requests by returning a list of relevant DataNode servers where the data lives. Client applications can talk directly to a DataNode, once the NameNode has provided the location of the data.HDFS block and Inputsplit
Block is the physical representation of data while split is the logical representation of data present in the block.Hadoop federation
HDFS Federation enhances an existing HDFS architecture. Hadoop Federation uses many independent Namenode/namespaces to scale the name service horizontally. It separates the namespace layer and the storage layer. Hence HDFS federation provides Isolation, Scalability and simple design.Don't Give Up. The beginning is always the hardest but life rewards those who work hard for it.