Saturday, 6 April 2019

Hadoop : Part - 4


Speculative Execution

Instead of identifying and fixing the slow-running tasks, Hadoop tries to detect when the task runs slower than expected and then launches other equivalent task as backup. This backup mechanism in Hadoop is Speculative Execution.

Heartbeat in HDFS

A heartbeat is a signal indicating that it is alive. A datanode sends heartbeat to Namenode.

Hadoop archives

Hadoop Archives (HAR) offers an effective way to deal with the small files problem.
Hadoop Archives or HAR is an archiving facility that packs files in to HDFS blocks efficiently and hence HAR can be used to tackle the small files problem in Hadoop.
hadoop archive -archiveName myhar.har /input/location /output/location
Once a .har file is created, you can do a listing on the .har file and you will see it is made up of index files and part files. Part files are nothing but the original files concatenated together in to a big file. Index files are look up files which is used to look up the individual small files inside the big part files.
hadoop fs -ls /output/location/myhar.har
/output/location/myhar.har/_index
/output/location/myhar.har/_masterindex
/output/location/myhar.har/part-000000

Reason for setting HDFS blocksize as 128MB

The block size is the smallest unit of data that a file system can store. If the blocksize is smaller it requires multiple lookups on namenode to locate the file. HDFS is meant to handle large files. If the blocksize is 128MB, then the number of requests goes down, greatly reducing the cost of overhead and load on the Name Node.

Data Locality in Hadoop

Data locality refers to the ability to move the computation close to where the actual data resides on the node, instead of moving large data to computation. This minimizes network congestion and increases the overall throughput of the system.

Safemode in Hadoop

Safemode in Apache Hadoop is a maintenance state of NameNode. During which NameNode doesn’t allow any modifications to the file system. During Safemode, HDFS cluster is in read-only and doesn’t replicate or delete blocks.

Single Point of Failure

In Hadoop 1.0, NameNode is a single point of Failure (SPOF). If namenode fails, all clients would unable to read/write files.
Hadoop 2.0 overcomes this SPOF by providing support for multiple NameNode. This feature provides  If active NameNode fails, then Standby-Namenode takes all the responsibility of active node.
some deployment requires high degree fault-tolerance. So new version 3.0 enable this feature by allowing the user to run multiple standby namenode.



Strive for excellence and success will follow you.. 


Wednesday, 27 March 2019

Hadoop : Part - 3


Checkpoint node

Checkpoint Node keeps track of the latest checkpoint in a directory that has same structure as that of NameNode’s directory. Checkpoint node creates checkpoints for the namespace at regular intervals by downloading the edits and fsimage file from the NameNode and merging it locally.

Backup node

It maintains its up-to-date in-memory copy of the file system namespace that is in sync with the active NameNode.

Overwriting replication factor in HDFS

The replication factor in HDFS can be modified or overwritten in 2 ways-
  • Using the Hadoop FS Shell, replication factor can be changed per file basis using the below command-$hadoop fs –setrep –w 2 /my/test_file (test_file is the filename whose replication factor will be set to 2)
  • Using the Hadoop FS Shell, replication factor of all files under a given directory can be modified using the below command- $hadoop fs –setrep –w 5 /my/test_dir (test_dir is the name of the directory and all the files in this directory will have a replication factor set to 5

Edge nodes

Edges nodes or gateway nodes are the interface between hadoop cluster and the external network. Edge nodes are used for running cluster adminstration tools and client applications.

InputFormats in Hadoop

  • TextInputFormat
  • Key Value Input Format
  • Sequence File Input Format

Rack

It is the collection of machines around 40-50. All these machines are connected using the same network switch and if that network goes down then all machines in that rack will be out of service. Thus we say rack is down.

Rack awareness

The physical location of the data nodes is referred to as Rack in HDFS. The rack id of each data node is acquired by the NameNode. The process of selecting closer data nodes depending on the rack information is known as Rack Awareness.

Replica Placement Policy

The contents present in the file are divided into data block. After consulting with the NameNode, client allocates 3 data nodes for each data block. For each data block, there exists 2 copies in one rack and the third copy is present in another rack. This is generally referred to as the Replica Placement Policy.


In the middle of difficulty lies opportunity.. 

Sunday, 24 March 2019

Google Cloud Platform(GCP) : Part- 2

Multi layered security approach

Google  also  designs  custom  chips, including  a  hardware security  chip  called  Titan  that's currently  being  deployed on  both servers  and  peripherals.  Google server machines  use  cryptographic signatures  to  make  sure  they  are  booting  the  correct  software.  Google designs and  builds its  own  data centers  which  incorporate  multiple layers  of  physical  security protections.
Google's infrastructure  provides  cryptographic privacy  and  integrity  for  remote  procedure called data-on-the-network, which  is  how  Google  services  communicate  with  each  other.  The infrastructure automatically  encrypts  our  PC traffic in  transit between data  centers. Google  central identity  service  which  usually  manifests  to  end  users  as the  Google log  in page  goes  beyond  asking  for  a simple username  and  password. It  also  intelligently  challenges  users  for  additional information  based  on risk factors  such  as  whether they  have  logged  in from  the same  device  or a  similar location  in  the past. Users can  also  use  second  factors  when  signing  in, including  devices  based  on  the  universal  second factor  U2F  open  standard.
Google  services  that  want  to  make  themselves  available  on  the Internet  register  themselves with an infrastructure service  called  the  Google  front  end(GFE),  which  checks  incoming  network connections for correct  certificates  and  best  practices. The  GFE also  additionally,  applies  protections against  denial  of service  attacks.  The  scale  of its  infrastructure,  enables Google  to  simply  absorb  many  denial of service  attacks, even behind  the  GFEs.  Google  also  has  multi-tier,  multi-layer  denial of service protections  that  further reduce  the risk  of any  denial  of service  impact. Inside  Google's infrastructure, machine  intelligence  and  rules warn  of possible incidents. Google  conducts  Red  Team  exercises simulated  attacks  to  improve  the effectiveness  of it's  responses.
The principle of Least Privilege says that  each  user should  have  only  those  privileges needed to  do  their  jobs.  In  a  least privilege  environment, people are  protected  from  an  entire class  of  errors.
GCP customers use  IAM(Identity and Access Management) to  implement  least  privilege,  and  it  makes  everybody  happier.  There  are  four ways  to interact  with GCP's  management layer:

  • Web-based console
  • SDK
  • Command-line  tools
  • APIs
  • Mobile  app

GCP Resource Hierarchy

All the  resources we  use,  whether  they're  virtual machines,  cloud  storage  buckets, tables  and  big  query  or anything  else in  GCP  are  organized into  projects. Optionally,  these  projects  may  be  organized into  folders. Folders  can contain  other  folders. All the  folders  and  projects  used  by  our  organization  can  be  brought  together under  an  organization  node. Project  folders  and  organization  nodes are  all places where  the  policies  can be  defined.
All Google  Cloud  platform  resources belong  to  a  project.  Projects are  the basis for  enabling  and  using  GCP  services  like  managing  APIs, enabling  billing  and  adding  and  removing  collaborators and  enabling  other  Google  services. Each project is a separate  compartment  and  each  resource  belongs  to  exactly  one.  Projects  can  have  different owners and  users, they're  built  separately  and  they're  managed separately. Each  GCP  project  has  a name  and  a  project  ID  that  we  assign. The  project  ID  is a  permanent  unchangeable  identifier  and  it  has to  be  unique  across GCP.  We use  project  IDs in several contexts  to  tell GCP  which  project  we want to work  with.  On the  other  hand, project  names  are  for  our convenience  and  we can  assign  them. GCP also  assigns each  of  our projects  a unique  project  number.
Folders let  teams have  the  ability  to  delegate  administrative  rights,  so  they  can work  independently. The  resources  in  a  folder  inherit  IAM policies  from  the  folder. Organisation node is the  top  of  the  resource hierarchy.  There  are  some  special  roles associated  with  it.

Identity and Access Management(IAM)

IAM  lets administrators authorize  who  can  take  action  on  specific  resources. An  IAM  policy  has:
  • A who part
  • A  can  do 
  • What part
  • An on  which  resource  part
The  who  part names  the  user  or users.  The  who  part  of  an IAM  policy  can  be  defined either  by  a Google  account,  a Google  group, a Service  account,  an entire  G Suite,  or a Cloud  Identity  domain.  The  can  do  what  part is  defined by  an IAM  role. An  IAM  role  is  a collection  of permissions.
There  are  three  kinds of roles in Cloud IAM. Primitive  roles  can be applied to  a  GCP  project  and  they affect  all  resources  in that  project. These  are  the  owner,  editor, and  viewer  roles. A viewer  can examine  a given resource but  not  change  it's  state.  If you're  an  editor,  you  can  do  everything  a viewer  can  do, plus  change  its  state. And  owner can  do  everything  an editor  can  do, plus manage  rolls and  permissions  on  the  resource.  The  owner role  can set  up  billing.  Often, companies  want  someone  to  be  able  to  control the  billing  for  a  project without  the  right  to  change  the  resources  in  the project. And  that's why  we  can  grant  someone  the billing  administrator role.

IAM Roles

InstantAdmin  Role lets  whoever  has that  role  perform  a  certain  set  of actions  on virtual  machines.  The  actions are  listing  compute engines, reading  and  changing  their configurations,  and  starting and  stopping  them. We must manage  permissions for custom roles.  Some  companies decide they'd  rather  stick  with the  predefined roles.  Custom  roles can  only  be  used  at  the project  or  organization  levels. They  can't  be  used  at  the  folder  level. Service  accounts  are  named with an  email  address. But instead  of passwords, they  use cryptographic keys  to  access resources.


Be that one you always wanted to be.. 

Saturday, 23 March 2019

Hadoop : Part - 2


When to use Hadoop


  • Support for multiple frameworks: hadoop can be integrated with multiple analytical tools like R and Python for Analytics and visualisation, Python and Spark for real-time processing, MongoDB and HBase for NoSQL database, Pentaho for BI etc
  • Data size and Data diversity
  • Lifetime data availability due to scalability and fault tolerance.

Hadoop Namenode failover process

In a High Availability cluster, two separate machines are configured as NameNodes. One of the NameNodes is in an Active state and the other is in a Standby state. The Active NameNode is responsible for all client operations in the cluster, while the Standby is simply acting as a slave.
In order for the Standby node to keep its state synchronized with the Active node, both nodes communicate with a group of separate daemons called “JournalNodes” (JNs). When any namespace modification is performed by the Active node, it durably logs a record of the modification to a majority of these JNs. The Standby node is capable of reading the edits from the JNs, and is constantly watching them for changes to the edit log. As the Standby Node sees the edits, it applies them to its own namespace. 

In the event of a failover, the Standby will ensure that it has read all of the edits from the JounalNodes before promoting itself to the Active state. Standby node have up-to-date information regarding the location of blocks in the cluster. In order to achieve this, the DataNodes are configured with the location of both NameNodes, and send block location information and heartbeats to both.
During a failover, the NameNode which is to become active will simply take over the role of writing to the JournalNodes, which will effectively prevent the other NameNode from continuing in the Active state.

Ways to rebalance the cluster when new datanodes are added


  • Select a subset of files that take up a good percentage of your disk space; copy them to new locations in HDFS; remove the old copies of the files; rename the new copies to their original names.
  • Way, with no interruption of service, is to turn up the replication of files, wait for transfers to stabilize, and then turn the replication back down.
  • Turn off the data-node, which is full, wait until its blocks are replicated, and then bring it back again. The over-replicated blocks will be randomly removed from different nodes.Execute the bin/start-balancer.sh command to run a balancing process to move blocks around the cluster automatically.

Actual data storage locations for NameNode and DataNode

A list of comma separated pathnames can be specified as dfs.datanode.data.dir for data storage in datanodes. The dfs.namenode.name.dir parameter is used to specify the namenode directories to store data.
Limiting DataNode's disk usage
The configuration dfs.datanode.du.reserved configuration in $HADOOP_HOME/conf/hdfs-site.xml can be used to limit disk usage.

Removing datanodes from a cluster

Removing one or two data-nodes will not lead to any data loss, because name-node will replicate their blocks as long as it will detect that the nodes are dead.
Hadoop offers the decommission feature to retire a set of existing data-nodes. The nodes to be retired should be included into the exclude file, and the exclude file name should be specified as a configuration parameter dfs.hosts.exclude. Specify the full hostname, ip or ip:port format in this file. Then the shell command
bin/hadoop dfsadmin -refreshNodes
should be called, which forces the name-node to re-read the exclude file and start the decommission process.
The decommission progress can be monitored on the name-node Web UI. Until all blocks are replicated the node will be in "Decommission In Progress" state. When decommission is done the state will change to "Decommissioned".

Files and block sizes

HDFS provides API to specify block size when creating a file. Hence multiple files can have different block sizes. FileSystem.create(path,overwrite, bufferSize,replication,blockSize,progress)

Hadoop streaming

Hadoop has a generic API for writing map reduce programs in any desired programming language like Python, Ruby, Perl etc. This is called Hadoop streaming.

Inter cluster data copy

Hadoop provides distCP(distributed copy) command to copy data across different Hadoop clusters.


Be the best thing that ever happen to everyone 

Saturday, 16 March 2019

Hadoop : Part - 1


History
Hadoop was created by Doug Cutting and Mike Cafarella in 2005. Doug was working at Yahoo at that time and is now Chief Architect of Cloudera. Hadoop was named after his son's toy elephant.
Hadoop
Apache Hadoop is a framework that provides various tools to store and process Big Data. It helps in analyzing Big Data and making business decisions. Hadoop stands for High Availability Distributed Object Oriented Platform.
Latest version
The latest version of Hadoop is 3.1.2 released on Feb 6, 2019
Companies using Hadoop
Cloudera, Amazon Web Services, IBM, Hortonworks, Intel, Microsoft etc
Top vendors offering Hadoop distribution
Cloudera, HortonWorks, Amazon Web Services Elastic MapReduce Hadoop Distribution, Microsoft, MapR, IBM etc
Advantages of Hadoop distributions
  • Technical Support
  • Consistent with patches, fixes and bug detection
  • Extra components for monitoring
  • Easy to install 
Modes of Hadoop
Hadoop can run in three modes:
  • Standalone- Default mode of Hadoop. It uses local file system for input and output operations. It is much faster when compared to other modes and is mainly used for debugging purpose.
  • Pseudo distributed(Single Node Cluster)- In this case all daemons are running on one node and thus both Master and Slave node are the same.
  • Fully distributed(Multiple Node Cluster)- Here separate nodes are allotted as Master and Slave. The data is distributed across several nodes on Hadoop cluster.

Main components of Hadoop
There are two main components namely:
  • Storage unit– HDFS
  • Processing framework– YARN

HDFS
HDFS (Hadoop Distributed File System) is the storage unit of Hadoop. It is responsible for storing different kinds of data in a distributed environment. It follows master - slave architecture.
Components of HDFS
  • NameNode: NameNode is the master node which is responsible for storing the metadata of all the files and directories such as block location, replication factors etc. It has information about blocks, that make a file, and where those blocks are located in the cluster. NameNode uses two files for storing the metadata namely:
Fsimage- It keeps track of the latest checkpoint of the namespace.
Edit log- It is the log of changes that have been made to the namespace since checkpoint.

  • DataNode: DataNodes are the slave nodes, which are responsible for storing data in the HDFS. NameNode manages all the DataNodes.

YARN
YARN (Yet Another Resource Negotiator) is the processing framework in Hadoop, which manages resources and provides an execution environment to the processes.
Components of YARN
  • ResourceManager: It receives the processing requests, and then passes the requests to corresponding NodeManagers accordingly, where the actual processing takes place. It allocates resources to applications based on the needs. It is the central authority that manages resources and schedule applications running on top of YARN.
  • NodeManager: NodeManager is installed on every DataNode and it is responsible for the execution of the task on every DataNode. It runs on slave machines, and is responsible for launching the application’s containers (where applications execute their part), monitoring their resource usage (CPU, memory, disk, network) and reporting these to the ResourceManager.
Hadoop daemons
Hadoop daemons can be broadly divided into three namely:

  • HDFS daemons- NameNode, DataNode, Secondary NameNode
  • YARN daemons- ResourceManager, NodeManager
  • JobHistoryServer
Secondary NameNode
It periodically merges the changes (edit log) with the FsImage (Filesystem Image), present in the NameNode. It stores the modified FsImage into persistent storage, which can be used in case of failure of NameNode.

JobHistoryServer
It maintains information about MapReduce jobs after the Application Master terminates.

He Who has a Why to live for, can bear almost any How

Saturday, 2 March 2019

Google Cloud Platform(GCP) : Part- 1

Cloud Computing


Cloud  computing  is  a way  of using  I.T.  that  has these  five important traits:
  • Get  computing  resources  on-demand  and  self-service.
  • Access  these  resources  over  the  internet  from  anywhere  we  want
  • The  provider of  these  resources has a  big  pool  of them  and  allocates  them
  • Resources are elastic
  • Pay only for what we use

GCP Architectures

Virtualize  data centers  brought  us Infrastructure  as  a Service,  IaaS,  and  Platform  as  a  Service, PaaS offerings.
IaaS  offerings provide  raw,  compute,  storage,  and  network  organized in  ways  that  are  familiar from  data  centers.
PaaS  offerings on  the  other  hand, bind  application  code  we write  to  libraries that give  access  to  the  infrastructure  our application  needs.
In  the  IaaS  model, we  pay  for what  we allocate. In  the  PaaS  model, we pay  for  what  we  use. Google's popular  applications  like, Search, Gmail,  Docs  and  Drive  are  Software  as a Service  applications.

Google Network

It's designed  to  give  its  users  the  highest possible throughput  and  the  lowest  possible latencies  for their applications. When  an  Internet user  sends  traffic  to  a Google resource,  Google  responds to  the user's request  from  an  edge  network location  that  will provide  the  lowest  latency. Google's  Edge-caching  network  cites content close  to  end users  to  minimize  latency.

GCP Regions and Zones

A zone  is a  deployment  area  for Google  Cloud  Platform  Resources. Zones  are  grouped into  regions, independent  geographic areas,  and  we  can  choose  what regions  our GCP  resources  are  in. All  the zones  within  a  region  have  fast  network  connectivity  among  them. Locations  within  regions usually  have  round  trip  network latencies  of under  five  milliseconds. Zone  is  a single failure  domain  within  a  region. As part  of  building  a  fault tolerant  application,  we  can spread the resources  across multiple  zones  in  a  region. That helps protect  against  unexpected failures. We  can  run  resources  in different regions  too.  Lots  of  GCP  customers  do  that,  both to bring  their applications closer  to  users  around  the  world, and  also  to  protect  against  the  loss of an  entire  region, say, due  to  a  natural disaster.
A few  Google  Cloud  platform  services  support  placing  resources  in  what we call  a  Multi-Region. For  example, Google cloud  storage  lets  us to  place  data within  the  Europe  Multi-Region. That  means, it  will be stored  redundantly  in at  least  two  geographic locations, separated  by  at  least  160  kilometers within  Europe.

Pricing

Google  was  the  first  major  cloud  provider  to  build  by  the second,  rather  than  rounding  up  to  bigger  units  of  time  for  its  virtual  machines as  a  service  offering. Google offers per second billing. Charges  for  rounding  can  really  add  up  for  customers  who  are  creating and  running  lots  of virtual  machines. Computer  engine  offers  automatically  applied sustained use discounts,  which  are  automatic  discounts that we get  for running  a  virtual machine  for a significant  portion  of the  billing  month.  When  we run  an instance  from  more  than  25  percent  of a  month, computer engine  automatically  gives  us a  discount  for every  incremental  minute  we  use  it.

Open APIs

Google helps  its  customers  avoid  feeling  locked  in. GCP  services  are  compatible with  open source products. For example, Bigtable  uses  the interface  of the open source  database  Apache  HBase,  which  gives  customers  the  benefit  of code  portability. Another example, Cloud  Dataproc  offers the  open source  big  data environment  Hadoop,  as a  managed service. Google  publishes key  elements  of technology  using  open source  licenses  to  create  ecosystems that provide  customers  with  options  other  than  Google. For example,  TensorFlow is  an  open source  software library  for  machine  learning  developed inside  Google.
Many GCP  technologies  provide  interoperability. Kubernetes  gives  customers  the  ability  to  mix and match  microservices  running  across  different  clouds,  and  Google Stackdriver  lets  customers  monitor workload  across  multiple cloud  providers.

Why GCP

Google  Cloud  Platform  lets  us choose  from  computing, storage,  big  data,  machine  learning  and application  services  for  web,  mobile,  analytics  and  backend  solutions.  It's global, it's  cost  effective, it's open  source  friendly  and  it's designed for security. Google Cloud  Platforms products and  services  can  be  broadly  categorized  as  compute,  storage, big  data,  machine  learning, networking and  operations and  tools.

We are born not to be Average, 
We are born to be Awesome.. 

Sunday, 5 August 2018

Data Analysis with Python

Python is an open-source and object-oriented programming language developed by Guido Van Rossum in 1980s. While implementing Python, Guido Van Rossum was also reading the published scripts from “Monty Python's Flying Circus”, a BBC comedy series. Van Rossum thought he needed a name that was short and slightly mysterious, so he decided to call the language Python. Python is an interpreted language. Compiled code is the executable code in assembly language. But interpreted languages must be translated at run time to CPU machine instructions. At Google, python is one of the 3 "official languages" alongside with C++ and Java. They even have a developer portal devoted to Python, with free classes offered including exercises and lecture videos (https://developers.google.com/edu/python/). Python is also used as the configuration language for Tupperware, Facebook's container deployment system.

Installation

2 options:

  • Directly download and install Python
  • Download and install Anaconda

Development environment


  • Terminal or shell
  • IDLE(Integrated Development and Learning Environment)
  • iPython Notebook

Data structures


  • Lists: List of comma separated values in square brackets. Items must be of same type.
  • Strings: They are immutable. Enclosed between single('), double(") or triple(''') quotes. Strings enclosed in triple quotes can span over multiple lines. 
  • Tuples: A number of values separated by commas. They are surrounded by parenthesis. They are immutable. Tuples are faster in processing as compared to lists due to its immutable nature.
  • Dictionary: Set of Key:Value pairs enclosed in parenthesis. Keys are unique.

Python libraries for data analysis


  • Numpy: Numerical Python. It provides the n-dimensioanl array feature. It also includes basic linear algebra functions, Fourier transforms, advances random number capabilities and tools for inetgration withlow level languages like C, C++ and Fortran
  • Scipy: Scientific Python. Library for discrete Fourier transforms, linear algebra, optimization and sparse matrices. It is a module for science built on Numpy.
  • Matplotlib: It is used for plotting graphs.
  • Pandas: It is used for structured data operations and manipulations. The following data structures are included in Pandas:
- Series: These are one dimensional labelled arrays.
- Dataframes: These are two dimensional data structures. It has column names and row indices.
  • Scikit Learn: It is a library for Machine Learning built on Numpy, Scipy and Matplotlib. It inclued tools for Classification, Regression, Clustering and Dimensionality Reduction.
  • NLTK: Natural Language Processing Tool Kit. It is a library for Natural Language Processing.
  • Stats Models: It is a library used for statistical Modeling
  • Seaborn: It is a library based on Matplotlib for statistical data visualization.
  • Bokeh: It is used for creating interative plots and dashboards on web browsers. It can visualize large and streaming datasets.
  • Blaze: It is built on Numpy and Pandas for streaming and distributed datasets. It has connectors to Apache Spark, MongoDB etc.
  • Scrapy: It is used for web crawling. It can be used to extract information from all pages of a website.
  • Sympy: It is a library for symbolic computation. It has the capability of formatting the result of the computations as LaTeX code.
  • Astropy: It is a package for Astronomy in Python.
  • Biopython: It is a set of tools for biological computation. 


Friendship is born at that moment when one person says to another: 'What! You too? I thought I was the only one' ðŸ˜Š