Big Big Things in my Little Little World: March 2019

Wednesday 27 March 2019

Hadoop : Part - 3

Checkpoint node

Checkpoint Node keeps track of the latest checkpoint in a directory that has same structure as that of NameNode’s directory. Checkpoint node creates checkpoints for the namespace at regular intervals by downloading the edits and fsimage file from the NameNode and merging it locally.

Backup node

It maintains its up-to-date in-memory copy of the file system namespace that is in sync with the active NameNode.

Overwriting replication factor in HDFS

The replication factor in HDFS can be modified or overwritten in 2 ways-

Using the Hadoop FS Shell, replication factor can be changed per file basis using the below command-$hadoop fs –setrep –w 2 /my/test_file (test_file is the filename whose replication factor will be set to 2)
Using the Hadoop FS Shell, replication factor of all files under a given directory can be modified using the below command- $hadoop fs –setrep –w 5 /my/test_dir (test_dir is the name of the directory and all the files in this directory will have a replication factor set to 5

Edge nodes

Edges nodes or gateway nodes are the interface between hadoop cluster and the external network. Edge nodes are used for running cluster adminstration tools and client applications.

InputFormats in Hadoop

TextInputFormat

Key Value Input Format

Sequence File Input Format

Rack

It is the collection of machines around 40-50. All these machines are connected using the same network switch and if that network goes down then all machines in that rack will be out of service. Thus we say rack is down.

Rack awareness

The physical location of the data nodes is referred to as Rack in HDFS. The rack id of each data node is acquired by the NameNode. The process of selecting closer data nodes depending on the rack information is known as Rack Awareness.

Replica Placement Policy

The contents present in the file are divided into data block. After consulting with the NameNode, client allocates 3 data nodes for each data block. For each data block, there exists 2 copies in one rack and the third copy is present in another rack. This is generally referred to as the Replica Placement Policy.

In the middle of difficulty lies opportunity..

Sunday 24 March 2019

Google Cloud Platform(GCP) : Part- 2

Multi layered security approach

Google also designs custom chips, including a hardware security chip called Titan that's currently being deployed on both servers and peripherals. Google server machines use cryptographic signatures to make sure they are booting the correct software. Google designs and builds its own data centers which incorporate multiple layers of physical security protections.
Google's infrastructure provides cryptographic privacy and integrity for remote procedure called data-on-the-network, which is how Google services communicate with each other. The infrastructure automatically encrypts our PC traffic in transit between data centers. Google central identity service which usually manifests to end users as the Google log in page goes beyond asking for a simple username and password. It also intelligently challenges users for additional information based on risk factors such as whether they have logged in from the same device or a similar location in the past. Users can also use second factors when signing in, including devices based on the universal second factor U2F open standard.
Google services that want to make themselves available on the Internet register themselves with an infrastructure service called the Google front end(GFE), which checks incoming network connections for correct certificates and best practices. The GFE also additionally, applies protections against denial of service attacks. The scale of its infrastructure, enables Google to simply absorb many denial of service attacks, even behind the GFEs. Google also has multi-tier, multi-layer denial of service protections that further reduce the risk of any denial of service impact. Inside Google's infrastructure, machine intelligence and rules warn of possible incidents. Google conducts Red Team exercises simulated attacks to improve the effectiveness of it's responses.
The principle of Least Privilege says that each user should have only those privileges needed to do their jobs. In a least privilege environment, people are protected from an entire class of errors.
GCP customers use IAM(Identity and Access Management) to implement least privilege, and it makes everybody happier. There are four ways to interact with GCP's management layer:

Web-based console

SDK

Command-line tools

APIs

Mobile app

GCP Resource Hierarchy

All the resources we use, whether they're virtual machines, cloud storage buckets, tables and big query or anything else in GCP are organized into projects. Optionally, these projects may be organized into folders. Folders can contain other folders. All the folders and projects used by our organization can be brought together under an organization node. Project folders and organization nodes are all places where the policies can be defined.
All Google Cloud platform resources belong to a project. Projects are the basis for enabling and using GCP services like managing APIs, enabling billing and adding and removing collaborators and enabling other Google services. Each project is a separate compartment and each resource belongs to exactly one. Projects can have different owners and users, they're built separately and they're managed separately. Each GCP project has a name and a project ID that we assign. The project ID is a permanent unchangeable identifier and it has to be unique across GCP. We use project IDs in several contexts to tell GCP which project we want to work with. On the other hand, project names are for our convenience and we can assign them. GCP also assigns each of our projects a unique project number.
Folders let teams have the ability to delegate administrative rights, so they can work independently. The resources in a folder inherit IAM policies from the folder. Organisation node is the top of the resource hierarchy. There are some special roles associated with it.

Identity and Access Management(IAM)

IAM lets administrators authorize who can take action on specific resources. An IAM policy has:

A who part

A can do

What part

An on which resource part

The who part names the user or users. The who part of an IAM policy can be defined either by a Google account, a Google group, a Service account, an entire G Suite, or a Cloud Identity domain. The can do what part is defined by an IAM role. An IAM role is a collection of permissions.
There are three kinds of roles in Cloud IAM. Primitive roles can be applied to a GCP project and they affect all resources in that project. These are the owner, editor, and viewer roles. A viewer can examine a given resource but not change it's state. If you're an editor, you can do everything a viewer can do, plus change its state. And owner can do everything an editor can do, plus manage rolls and permissions on the resource. The owner role can set up billing. Often, companies want someone to be able to control the billing for a project without the right to change the resources in the project. And that's why we can grant someone the billing administrator role.

IAM Roles

InstantAdmin Role lets whoever has that role perform a certain set of actions on virtual machines. The actions are listing compute engines, reading and changing their configurations, and starting and stopping them. We must manage permissions for custom roles. Some companies decide they'd rather stick with the predefined roles. Custom roles can only be used at the project or organization levels. They can't be used at the folder level. Service accounts are named with an email address. But instead of passwords, they use cryptographic keys to access resources.

Be that one you always wanted to be..

Saturday 23 March 2019

Hadoop : Part - 2

When to use Hadoop

Support for multiple frameworks: hadoop can be integrated with multiple analytical tools like R and Python for Analytics and visualisation, Python and Spark for real-time processing, MongoDB and HBase for NoSQL database, Pentaho for BI etc

Data size and Data diversity

Lifetime data availability due to scalability and fault tolerance.

Hadoop Namenode failover process

In a High Availability cluster, two separate machines are configured as NameNodes. One of the NameNodes is in an Active state and the other is in a Standby state. The Active NameNode is responsible for all client operations in the cluster, while the Standby is simply acting as a slave.
In order for the Standby node to keep its state synchronized with the Active node, both nodes communicate with a group of separate daemons called “JournalNodes” (JNs). When any namespace modification is performed by the Active node, it durably logs a record of the modification to a majority of these JNs. The Standby node is capable of reading the edits from the JNs, and is constantly watching them for changes to the edit log. As the Standby Node sees the edits, it applies them to its own namespace.

In the event of a failover, the Standby will ensure that it has read all of the edits from the JounalNodes before promoting itself to the Active state. Standby node have up-to-date information regarding the location of blocks in the cluster. In order to achieve this, the DataNodes are configured with the location of both NameNodes, and send block location information and heartbeats to both.
During a failover, the NameNode which is to become active will simply take over the role of writing to the JournalNodes, which will effectively prevent the other NameNode from continuing in the Active state.

Ways to rebalance the cluster when new datanodes are added

Select a subset of files that take up a good percentage of your disk space; copy them to new locations in HDFS; remove the old copies of the files; rename the new copies to their original names.

Way, with no interruption of service, is to turn up the replication of files, wait for transfers to stabilize, and then turn the replication back down.

Turn off the data-node, which is full, wait until its blocks are replicated, and then bring it back again. The over-replicated blocks will be randomly removed from different nodes.Execute the bin/start-balancer.sh command to run a balancing process to move blocks around the cluster automatically.

Actual data storage locations for NameNode and DataNode

A list of comma separated pathnames can be specified as dfs.datanode.data.dir for data storage in datanodes. The dfs.namenode.name.dir parameter is used to specify the namenode directories to store data.

Limiting DataNode's disk usage
The configuration dfs.datanode.du.reserved configuration in $HADOOP_HOME/conf/hdfs-site.xml can be used to limit disk usage.

Removing datanodes from a cluster

Removing one or two data-nodes will not lead to any data loss, because name-node will replicate their blocks as long as it will detect that the nodes are dead.
Hadoop offers the decommission feature to retire a set of existing data-nodes. The nodes to be retired should be included into the exclude file, and the exclude file name should be specified as a configuration parameter dfs.hosts.exclude. Specify the full hostname, ip or ip:port format in this file. Then the shell command
bin/hadoop dfsadmin -refreshNodes
should be called, which forces the name-node to re-read the exclude file and start the decommission process.
The decommission progress can be monitored on the name-node Web UI. Until all blocks are replicated the node will be in "Decommission In Progress" state. When decommission is done the state will change to "Decommissioned".

Files and block sizes

HDFS provides API to specify block size when creating a file. Hence multiple files can have different block sizes. FileSystem.create(path,overwrite, bufferSize,replication,blockSize,progress)

Hadoop streaming

Hadoop has a generic API for writing map reduce programs in any desired programming language like Python, Ruby, Perl etc. This is called Hadoop streaming.

Inter cluster data copy

Hadoop provides distCP(distributed copy) command to copy data across different Hadoop clusters.

Be the best thing that ever happen to everyone

Saturday 16 March 2019

Hadoop : Part - 1

History

Hadoop was created by Doug Cutting and Mike Cafarella in 2005. Doug was working at Yahoo at that time and is now Chief Architect of Cloudera. Hadoop was named after his son's toy elephant.

Hadoop
Apache Hadoop is a framework that provides various tools to store and process Big Data. It helps in analyzing Big Data and making business decisions. Hadoop stands for High Availability Distributed Object Oriented Platform.

Latest version
The latest version of Hadoop is 3.1.2 released on Feb 6, 2019

Companies using Hadoop

Cloudera, Amazon Web Services, IBM, Hortonworks, Intel, Microsoft etc

Top vendors offering Hadoop distribution
Cloudera, HortonWorks, Amazon Web Services Elastic MapReduce Hadoop Distribution, Microsoft, MapR, IBM etc

Advantages of Hadoop distributions

Technical Support
Consistent with patches, fixes and bug detection
Extra components for monitoring
Easy to install

Modes of Hadoop
Hadoop can run in three modes:

Standalone- Default mode of Hadoop. It uses local file system for input and output operations. It is much faster when compared to other modes and is mainly used for debugging purpose.
Pseudo distributed(Single Node Cluster)- In this case all daemons are running on one node and thus both Master and Slave node are the same.
Fully distributed(Multiple Node Cluster)- Here separate nodes are allotted as Master and Slave. The data is distributed across several nodes on Hadoop cluster.

Main components of Hadoop
There are two main components namely:

Storage unit– HDFS
Processing framework– YARN

HDFS
HDFS (Hadoop Distributed File System) is the storage unit of Hadoop. It is responsible for storing different kinds of data in a distributed environment. It follows master - slave architecture.

Components of HDFS

NameNode: NameNode is the master node which is responsible for storing the metadata of all the files and directories such as block location, replication factors etc. It has information about blocks, that make a file, and where those blocks are located in the cluster. NameNode uses two files for storing the metadata namely:

Fsimage- It keeps track of the latest checkpoint of the namespace.
Edit log- It is the log of changes that have been made to the namespace since checkpoint.

DataNode: DataNodes are the slave nodes, which are responsible for storing data in the HDFS. NameNode manages all the DataNodes.

YARN
YARN (Yet Another Resource Negotiator) is the processing framework in Hadoop, which manages resources and provides an execution environment to the processes.

Components of YARN

ResourceManager: It receives the processing requests, and then passes the requests to corresponding NodeManagers accordingly, where the actual processing takes place. It allocates resources to applications based on the needs. It is the central authority that manages resources and schedule applications running on top of YARN.
NodeManager: NodeManager is installed on every DataNode and it is responsible for the execution of the task on every DataNode. It runs on slave machines, and is responsible for launching the application’s containers (where applications execute their part), monitoring their resource usage (CPU, memory, disk, network) and reporting these to the ResourceManager.

Hadoop daemons
Hadoop daemons can be broadly divided into three namely:

HDFS daemons- NameNode, DataNode, Secondary NameNode

YARN daemons- ResourceManager, NodeManager

JobHistoryServer

Secondary NameNode
It periodically merges the changes (edit log) with the FsImage (Filesystem Image), present in the NameNode. It stores the modified FsImage into persistent storage, which can be used in case of failure of NameNode.

JobHistoryServer
It maintains information about MapReduce jobs after the Application Master terminates.

He Who has a Why to live for, can bear almost any How

Saturday 2 March 2019

Google Cloud Platform(GCP) : Part- 1

Cloud Computing

Cloud computing is a way of using I.T. that has these five important traits:

Get computing resources on-demand and self-service.
Access these resources over the internet from anywhere we want
The provider of these resources has a big pool of them and allocates them
Resources are elastic
Pay only for what we use

GCP Architectures

Virtualize data centers brought us Infrastructure as a Service, IaaS, and Platform as a Service, PaaS offerings.
IaaS offerings provide raw, compute, storage, and network organized in ways that are familiar from data centers.
PaaS offerings on the other hand, bind application code we write to libraries that give access to the infrastructure our application needs.
In the IaaS model, we pay for what we allocate. In the PaaS model, we pay for what we use. Google's popular applications like, Search, Gmail, Docs and Drive are Software as a Service applications.

Google Network

It's designed to give its users the highest possible throughput and the lowest possible latencies for their applications. When an Internet user sends traffic to a Google resource, Google responds to the user's request from an edge network location that will provide the lowest latency. Google's Edge-caching network cites content close to end users to minimize latency.

GCP Regions and Zones

A zone is a deployment area for Google Cloud Platform Resources. Zones are grouped into regions, independent geographic areas, and we can choose what regions our GCP resources are in. All the zones within a region have fast network connectivity among them. Locations within regions usually have round trip network latencies of under five milliseconds. Zone is a single failure domain within a region. As part of building a fault tolerant application, we can spread the resources across multiple zones in a region. That helps protect against unexpected failures. We can run resources in different regions too. Lots of GCP customers do that, both to bring their applications closer to users around the world, and also to protect against the loss of an entire region, say, due to a natural disaster.
A few Google Cloud platform services support placing resources in what we call a Multi-Region. For example, Google cloud storage lets us to place data within the Europe Multi-Region. That means, it will be stored redundantly in at least two geographic locations, separated by at least 160 kilometers within Europe.

Pricing

Google was the first major cloud provider to build by the second, rather than rounding up to bigger units of time for its virtual machines as a service offering. Google offers per second billing. Charges for rounding can really add up for customers who are creating and running lots of virtual machines. Computer engine offers automatically applied sustained use discounts, which are automatic discounts that we get for running a virtual machine for a significant portion of the billing month. When we run an instance from more than 25 percent of a month, computer engine automatically gives us a discount for every incremental minute we use it.

Open APIs

Google helps its customers avoid feeling locked in. GCP services are compatible with open source products. For example, Bigtable uses the interface of the open source database Apache HBase, which gives customers the benefit of code portability. Another example, Cloud Dataproc offers the open source big data environment Hadoop, as a managed service. Google publishes key elements of technology using open source licenses to create ecosystems that provide customers with options other than Google. For example, TensorFlow is an open source software library for machine learning developed inside Google.
Many GCP technologies provide interoperability. Kubernetes gives customers the ability to mix and match microservices running across different clouds, and Google Stackdriver lets customers monitor workload across multiple cloud providers.

Why GCP

Google Cloud Platform lets us choose from computing, storage, big data, machine learning and application services for web, mobile, analytics and backend solutions. It's global, it's cost effective, it's open source friendly and it's designed for security. Google Cloud Platforms products and services can be broadly categorized as compute, storage, big data, machine learning, networking and operations and tools.

We are born not to be Average,

We are born to be Awesome..

Big Big Things in my Little Little World