Sunday 14 April 2019

Hadoop : Part - 5


Security in Hadoop

Apache Hadoop achieves security by using Kerberos.
At a high level, there are three steps that a client must take to access a service when using Kerberos.

  • Authentication – The client authenticates itself to the authentication server. Then, receives a timestamped Ticket-Granting Ticket (TGT).
  • Authorization – The client uses the TGT to request a service ticket from the Ticket Granting Server.
  • Service Request – The client uses the service ticket to authenticate itself to the server.

Concurrent writes in HDFS

Multiple clients cannot write into an HDFS file at same time. Apache Hadoop HDFS follows single writer multiple reader models. The client which opens a file for writing, the NameNode grant a lease. Now suppose, some other client wants to write into that file. It asks NameNode for the write operation. NameNode first checks whether it has granted the lease for writing into that file to someone else or not. When someone already acquires the lease, then, it will reject the write request of the other client.

fsck

fsck is the File System Check. HDFS use the fsck (filesystem check) command to check for various inconsistencies. It also reports the problems like missing blocks for a file or under-replicated blocks. NameNode automatically corrects most of the recoverable failures. Filesystem check can run on the whole file system or on a subset of files.

Datanode failures

NameNode periodically receives a Heartbeat and a Blockreport from each of the DataNodes in the cluster. Receipt of a Heartbeat implies that the DataNode is functioning properly. A Blockreport contains a list of all blocks on a DataNode. When NameNode notices that it has not recieved a hearbeat message from a data node after a certain amount of time, the data node is marked as dead. Since blocks will be under replicated the system begins replicating the blocks that were stored on the dead datanode. The NameNode Orchestrates the replication of data blocks from one datanode to another. The replication data transfer happens directly between datanodes and the data never passes through the namenode.

Taskinstances

Task instances are the actual MapReduce jobs which are run on each slave node. Each Task Instance runs on its own JVM process. There can be multiple processes of task instance running on a slave node.

Communication to HDFS

The Client communication to HDFS happens using Hadoop HDFS API. Client applications talk to the NameNode whenever they wish to locate a file. The NameNode responds the successful requests by returning a list of relevant DataNode servers where the data lives. Client applications can talk directly to a DataNode, once the NameNode has provided the location of the data.

HDFS block and Inputsplit

Block is the physical representation of data while split is the logical representation of data present in the block.

Hadoop federation

HDFS Federation enhances an existing HDFS architecture. Hadoop Federation uses many independent Namenode/namespaces to scale the name service horizontally. It separates the namespace layer and the storage layer. Hence HDFS federation provides Isolation, Scalability and simple design.



Don't Give Up. The beginning is always the hardest but life rewards those who work hard for it.

Saturday 6 April 2019

Hadoop : Part - 4


Speculative Execution

Instead of identifying and fixing the slow-running tasks, Hadoop tries to detect when the task runs slower than expected and then launches other equivalent task as backup. This backup mechanism in Hadoop is Speculative Execution.

Heartbeat in HDFS

A heartbeat is a signal indicating that it is alive. A datanode sends heartbeat to Namenode.

Hadoop archives

Hadoop Archives (HAR) offers an effective way to deal with the small files problem.
Hadoop Archives or HAR is an archiving facility that packs files in to HDFS blocks efficiently and hence HAR can be used to tackle the small files problem in Hadoop.
hadoop archive -archiveName myhar.har /input/location /output/location
Once a .har file is created, you can do a listing on the .har file and you will see it is made up of index files and part files. Part files are nothing but the original files concatenated together in to a big file. Index files are look up files which is used to look up the individual small files inside the big part files.
hadoop fs -ls /output/location/myhar.har
/output/location/myhar.har/_index
/output/location/myhar.har/_masterindex
/output/location/myhar.har/part-000000

Reason for setting HDFS blocksize as 128MB

The block size is the smallest unit of data that a file system can store. If the blocksize is smaller it requires multiple lookups on namenode to locate the file. HDFS is meant to handle large files. If the blocksize is 128MB, then the number of requests goes down, greatly reducing the cost of overhead and load on the Name Node.

Data Locality in Hadoop

Data locality refers to the ability to move the computation close to where the actual data resides on the node, instead of moving large data to computation. This minimizes network congestion and increases the overall throughput of the system.

Safemode in Hadoop

Safemode in Apache Hadoop is a maintenance state of NameNode. During which NameNode doesn’t allow any modifications to the file system. During Safemode, HDFS cluster is in read-only and doesn’t replicate or delete blocks.

Single Point of Failure

In Hadoop 1.0, NameNode is a single point of Failure (SPOF). If namenode fails, all clients would unable to read/write files.
Hadoop 2.0 overcomes this SPOF by providing support for multiple NameNode. This feature provides  If active NameNode fails, then Standby-Namenode takes all the responsibility of active node.
some deployment requires high degree fault-tolerance. So new version 3.0 enable this feature by allowing the user to run multiple standby namenode.



Strive for excellence and success will follow you..