Saturday, 16 March 2019

Hadoop : Part - 1


History
Hadoop was created by Doug Cutting and Mike Cafarella in 2005. Doug was working at Yahoo at that time and is now Chief Architect of Cloudera. Hadoop was named after his son's toy elephant.
Hadoop
Apache Hadoop is a framework that provides various tools to store and process Big Data. It helps in analyzing Big Data and making business decisions. Hadoop stands for High Availability Distributed Object Oriented Platform.
Latest version
The latest version of Hadoop is 3.1.2 released on Feb 6, 2019
Companies using Hadoop
Cloudera, Amazon Web Services, IBM, Hortonworks, Intel, Microsoft etc
Top vendors offering Hadoop distribution
Cloudera, HortonWorks, Amazon Web Services Elastic MapReduce Hadoop Distribution, Microsoft, MapR, IBM etc
Advantages of Hadoop distributions
  • Technical Support
  • Consistent with patches, fixes and bug detection
  • Extra components for monitoring
  • Easy to install 
Modes of Hadoop
Hadoop can run in three modes:
  • Standalone- Default mode of Hadoop. It uses local file system for input and output operations. It is much faster when compared to other modes and is mainly used for debugging purpose.
  • Pseudo distributed(Single Node Cluster)- In this case all daemons are running on one node and thus both Master and Slave node are the same.
  • Fully distributed(Multiple Node Cluster)- Here separate nodes are allotted as Master and Slave. The data is distributed across several nodes on Hadoop cluster.

Main components of Hadoop
There are two main components namely:
  • Storage unit– HDFS
  • Processing framework– YARN

HDFS
HDFS (Hadoop Distributed File System) is the storage unit of Hadoop. It is responsible for storing different kinds of data in a distributed environment. It follows master - slave architecture.
Components of HDFS
  • NameNode: NameNode is the master node which is responsible for storing the metadata of all the files and directories such as block location, replication factors etc. It has information about blocks, that make a file, and where those blocks are located in the cluster. NameNode uses two files for storing the metadata namely:
Fsimage- It keeps track of the latest checkpoint of the namespace.
Edit log- It is the log of changes that have been made to the namespace since checkpoint.

  • DataNode: DataNodes are the slave nodes, which are responsible for storing data in the HDFS. NameNode manages all the DataNodes.

YARN
YARN (Yet Another Resource Negotiator) is the processing framework in Hadoop, which manages resources and provides an execution environment to the processes.
Components of YARN
  • ResourceManager: It receives the processing requests, and then passes the requests to corresponding NodeManagers accordingly, where the actual processing takes place. It allocates resources to applications based on the needs. It is the central authority that manages resources and schedule applications running on top of YARN.
  • NodeManager: NodeManager is installed on every DataNode and it is responsible for the execution of the task on every DataNode. It runs on slave machines, and is responsible for launching the application’s containers (where applications execute their part), monitoring their resource usage (CPU, memory, disk, network) and reporting these to the ResourceManager.
Hadoop daemons
Hadoop daemons can be broadly divided into three namely:

  • HDFS daemons- NameNode, DataNode, Secondary NameNode
  • YARN daemons- ResourceManager, NodeManager
  • JobHistoryServer
Secondary NameNode
It periodically merges the changes (edit log) with the FsImage (Filesystem Image), present in the NameNode. It stores the modified FsImage into persistent storage, which can be used in case of failure of NameNode.

JobHistoryServer
It maintains information about MapReduce jobs after the Application Master terminates.

He Who has a Why to live for, can bear almost any How

Saturday, 2 March 2019

Google Cloud Platform(GCP) : Part- 1

Cloud Computing


Cloud  computing  is  a way  of using  I.T.  that  has these  five important traits:
  • Get  computing  resources  on-demand  and  self-service.
  • Access  these  resources  over  the  internet  from  anywhere  we  want
  • The  provider of  these  resources has a  big  pool  of them  and  allocates  them
  • Resources are elastic
  • Pay only for what we use

GCP Architectures

Virtualize  data centers  brought  us Infrastructure  as  a Service,  IaaS,  and  Platform  as  a  Service, PaaS offerings.
IaaS  offerings provide  raw,  compute,  storage,  and  network  organized in  ways  that  are  familiar from  data  centers.
PaaS  offerings on  the  other  hand, bind  application  code  we write  to  libraries that give  access  to  the  infrastructure  our application  needs.
In  the  IaaS  model, we  pay  for what  we allocate. In  the  PaaS  model, we pay  for  what  we  use. Google's popular  applications  like, Search, Gmail,  Docs  and  Drive  are  Software  as a Service  applications.

Google Network

It's designed  to  give  its  users  the  highest possible throughput  and  the  lowest  possible latencies  for their applications. When  an  Internet user  sends  traffic  to  a Google resource,  Google  responds to  the user's request  from  an  edge  network location  that  will provide  the  lowest  latency. Google's  Edge-caching  network  cites content close  to  end users  to  minimize  latency.

GCP Regions and Zones

A zone  is a  deployment  area  for Google  Cloud  Platform  Resources. Zones  are  grouped into  regions, independent  geographic areas,  and  we  can  choose  what regions  our GCP  resources  are  in. All  the zones  within  a  region  have  fast  network  connectivity  among  them. Locations  within  regions usually  have  round  trip  network latencies  of under  five  milliseconds. Zone  is  a single failure  domain  within  a  region. As part  of  building  a  fault tolerant  application,  we  can spread the resources  across multiple  zones  in  a  region. That helps protect  against  unexpected failures. We  can  run  resources  in different regions  too.  Lots  of  GCP  customers  do  that,  both to bring  their applications closer  to  users  around  the  world, and  also  to  protect  against  the  loss of an  entire  region, say, due  to  a  natural disaster.
A few  Google  Cloud  platform  services  support  placing  resources  in  what we call  a  Multi-Region. For  example, Google cloud  storage  lets  us to  place  data within  the  Europe  Multi-Region. That  means, it  will be stored  redundantly  in at  least  two  geographic locations, separated  by  at  least  160  kilometers within  Europe.

Pricing

Google  was  the  first  major  cloud  provider  to  build  by  the second,  rather  than  rounding  up  to  bigger  units  of  time  for  its  virtual  machines as  a  service  offering. Google offers per second billing. Charges  for  rounding  can  really  add  up  for  customers  who  are  creating and  running  lots  of virtual  machines. Computer  engine  offers  automatically  applied sustained use discounts,  which  are  automatic  discounts that we get  for running  a  virtual machine  for a significant  portion  of the  billing  month.  When  we run  an instance  from  more  than  25  percent  of a  month, computer engine  automatically  gives  us a  discount  for every  incremental  minute  we  use  it.

Open APIs

Google helps  its  customers  avoid  feeling  locked  in. GCP  services  are  compatible with  open source products. For example, Bigtable  uses  the interface  of the open source  database  Apache  HBase,  which  gives  customers  the  benefit  of code  portability. Another example, Cloud  Dataproc  offers the  open source  big  data environment  Hadoop,  as a  managed service. Google  publishes key  elements  of technology  using  open source  licenses  to  create  ecosystems that provide  customers  with  options  other  than  Google. For example,  TensorFlow is  an  open source  software library  for  machine  learning  developed inside  Google.
Many GCP  technologies  provide  interoperability. Kubernetes  gives  customers  the  ability  to  mix and match  microservices  running  across  different  clouds,  and  Google Stackdriver  lets  customers  monitor workload  across  multiple cloud  providers.

Why GCP

Google  Cloud  Platform  lets  us choose  from  computing, storage,  big  data,  machine  learning  and application  services  for  web,  mobile,  analytics  and  backend  solutions.  It's global, it's  cost  effective, it's open  source  friendly  and  it's designed for security. Google Cloud  Platforms products and  services  can  be  broadly  categorized  as  compute,  storage, big  data,  machine  learning, networking and  operations and  tools.

We are born not to be Average, 
We are born to be Awesome.. 

Sunday, 5 August 2018

Data Analysis with Python

Python is an open-source and object-oriented programming language developed by Guido Van Rossum in 1980s. While implementing Python, Guido Van Rossum was also reading the published scripts from “Monty Python's Flying Circus”, a BBC comedy series. Van Rossum thought he needed a name that was short and slightly mysterious, so he decided to call the language Python. Python is an interpreted language. Compiled code is the executable code in assembly language. But interpreted languages must be translated at run time to CPU machine instructions. At Google, python is one of the 3 "official languages" alongside with C++ and Java. They even have a developer portal devoted to Python, with free classes offered including exercises and lecture videos (https://developers.google.com/edu/python/). Python is also used as the configuration language for Tupperware, Facebook's container deployment system.

Installation

2 options:

  • Directly download and install Python
  • Download and install Anaconda

Development environment


  • Terminal or shell
  • IDLE(Integrated Development and Learning Environment)
  • iPython Notebook

Data structures


  • Lists: List of comma separated values in square brackets. Items must be of same type.
  • Strings: They are immutable. Enclosed between single('), double(") or triple(''') quotes. Strings enclosed in triple quotes can span over multiple lines. 
  • Tuples: A number of values separated by commas. They are surrounded by parenthesis. They are immutable. Tuples are faster in processing as compared to lists due to its immutable nature.
  • Dictionary: Set of Key:Value pairs enclosed in parenthesis. Keys are unique.

Python libraries for data analysis


  • Numpy: Numerical Python. It provides the n-dimensioanl array feature. It also includes basic linear algebra functions, Fourier transforms, advances random number capabilities and tools for inetgration withlow level languages like C, C++ and Fortran
  • Scipy: Scientific Python. Library for discrete Fourier transforms, linear algebra, optimization and sparse matrices. It is a module for science built on Numpy.
  • Matplotlib: It is used for plotting graphs.
  • Pandas: It is used for structured data operations and manipulations. The following data structures are included in Pandas:
- Series: These are one dimensional labelled arrays.
- Dataframes: These are two dimensional data structures. It has column names and row indices.
  • Scikit Learn: It is a library for Machine Learning built on Numpy, Scipy and Matplotlib. It inclued tools for Classification, Regression, Clustering and Dimensionality Reduction.
  • NLTK: Natural Language Processing Tool Kit. It is a library for Natural Language Processing.
  • Stats Models: It is a library used for statistical Modeling
  • Seaborn: It is a library based on Matplotlib for statistical data visualization.
  • Bokeh: It is used for creating interative plots and dashboards on web browsers. It can visualize large and streaming datasets.
  • Blaze: It is built on Numpy and Pandas for streaming and distributed datasets. It has connectors to Apache Spark, MongoDB etc.
  • Scrapy: It is used for web crawling. It can be used to extract information from all pages of a website.
  • Sympy: It is a library for symbolic computation. It has the capability of formatting the result of the computations as LaTeX code.
  • Astropy: It is a package for Astronomy in Python.
  • Biopython: It is a set of tools for biological computation. 


Friendship is born at that moment when one person says to another: 'What! You too? I thought I was the only one' ðŸ˜Š


Sunday, 18 March 2018

Smart Digital Store- DigiShopiGo

Tom is planning to buy a smartphone. He searches online but finds different brands with the same specifications. He even puts a Facebook post asking for suggestions. Accidentally he notices 'DigiShopiGo' web store. 'DigiShopiGo' is a world famous retail chain offering wide variety of products. Tom decides to search for the smartphone in the web store.

The first step to search for a product in 'DigiShopiGo' is to register the email-id. After registering Tom sees variety of models with the same specifications which his friends had bought. He could also see reviews of the product. Tom purchases the product. He gets intimations regarding the delivery of the product. To his surprise a drone delivered the purchased mobile in 30 minutes. Tom receives the product. He is very much satisfied with his experience in purchasing the product and shares his opinion in social media. 

'DigiShopiGo' has an advanced analytics wing that keeps track of the social media activities of its customers. Identifying Tom's opinion as Positive feedback, It starts sending intimations to Tom regarding the sale and availability of mobile accessories. 

Accidentally Tom loses his mobile charger. He searches for an electronics shop online and offline. Suddenly he gets an intimation regarding the 'DigiShopiGo' outlet just nearby. 'DigiShopiGo' uses shopper's location data to showcase the nearest retail location. On entering the store, Tom was astonished to see a map showing the exact location of the item he needs to buy.Tom also gets instant notifications regarding the estimated wait time at the store for billing. As Tom walks he could also see smart digital shelves that gives him a personalized experience highlighting products of his interest. He instantly shares the experience in social media. So being digital means being social and accessible.

Be where the world is going!

Sunday, 28 January 2018

FAIR releases Detectron

Facebook’s AI research team(FAIR) has been working on the problem of object detection by using deep learning to give computers the ability to reach conclusions about what objects are present in a scene. The company’s object detection algorithm, based on the Caffe2 deep learning framework, is called Detectron. The Detectron project was started in July 2016 with the goal of creating a fast and flexible object detection system. It implements state-of-the-art object detection algorithms. It is written in Python and powered by the Caffe2 deep learning framework. The algorithms examine video input and are able to make guesses about what discrete objects comprise the scene.

At FAIR, Detectron has enabled numerous research projects, including: 

  • Feature Pyramid Networks for Object Detection: Feature pyramids are a basic component in recognition systems for detecting objects at different scales. But it is not currently recommended due to its compute and memory intensive nature.
  • Mask R-CNN: It is a general framework for object instance segmentation. In object instance segmentation, given an image, the goal is to label each pixel according to its object class as well as its object instance. Instance segmentation is closely related to two important tasks in computer vision, namely semantic segmentation and object detection. The goal of semantic segmentation is to label each pixel according to its object class. However, semantic segmentation does not differentiate between two different object instances of the same class. For example, if there are two persons in an image, semantic segmentation will assign the same label to pixels belonging to either of these two persons. The goal of object detection is to predict the bounding box and the object class of each object instance in the image. However, object detection does not provide per-pixel labeling of the object instance. Compared with semantic segmentation and object detection, object instance segmentation is strictly more challenging, since it aims to identify object instance as well as provide per-pixel labeling of each object instance.
  • Detecting and Recognizing Human-Object Interactions: To understand the visual world, a machine must not only recognize individual object instances but also how they interact. The Human-Object interaction is detected and represented as triplets<human, verb, object> in photos. Eg: <person, reads, book>
  • Focal Loss for Dense Object Detection: The highest accuracy object detectors to date are based on a two-stage approach popularized by R-CNN, where a classifier is applied to a sparse set of candidate object locations. In contrast, one-stage detectors that are applied over a regular, dense sampling of possible object locations have the potential to be faster and simpler, but have trailed the accuracy of two-stage detectors. An object detector named Retinanet is designed to identify the loss. RetinaNet is able to match the speed of previous one-stage detectors while surpassing the accuracy of all existing state-of-the-art two-stage detectors.
  • Non-local Neural Networks: Non-local means is an algorithm in image processing for image denoising. Unlike "local mean" filters, which take the mean value of a group of pixels surrounding a target pixel to smooth the image, non-local means filtering takes a mean of all pixels in the image, weighted by how similar these pixels are to the target pixel. This results in much greater post-filtering clarity, and less loss of detail in the image compared with local mean algorithms. Inspired by the classical non-local means method in computer vision, the non-local operation computes the response at a position as a weighted sum of the features at all positions. This building block can be plugged into many computer vision architectures.
  • Learning to Segment Every Thing: Existing methods for object instance segmentation require all training instances to be labeled with segmentation masks. This requirement makes it expensive to annotate new categories and has restricted instance segmentation models to ~100 well-annotated classes. A new partially supervised training paradigm is proposed, together with a novel weight transfer function, that enables training instance segmentation models over a large set of categories for which all have box annotations, but only a small fraction have mask annotations.
  • Data Distillation: Omni-supervised learning is a special area of semi-supervised learning in which the learner exploits all available labeled data plus internet-scale sources of unlabeled data. Data distillation is a method that ensembles predictions from multiple transformations of unlabeled data, using a single model, to automatically generate new training annotations.

The goal of Detectron is to provide a high-quality, high-performance codebase for object detection research. It is designed to be flexible in order to support rapid implementation and evaluation of novel research. Detectron includes implementations of the following object detection algorithms:

  • Mask R-CNN
  • RetinaNet
  • Faster R-CNN
  • RPN
  • Fast R-CNN
  • R-FCN


 From augmented reality to various computer vision tasks, Detectron has a wide variety of uses. One of the many things that this new platform can do is object masking. Object masking takes objected detection a step further and instead of just drawing a bounding box around the image, it can actually draw a complex polygon. Detectron is available under the Apache 2.0 licence at GitHub. The company says it is also releasing extensive performance baselines for more than 70 pre-trained models that are available to download from its model zoo on GitHub. Once the model is trained, it can be deployed on the cloud and even on mobile devices.

References


  1. https://www.techleer.com/articles/469-facebook-announces-open-sourcing-of-detectron-a-real-time-object-detection/
  2. https://github.com/facebookresearch/Detectron
  3. https://arxiv.org/


Success is walking from failure to failure with no loss of enthusiasm..


Sunday, 10 December 2017

K-means clustering

It is the process of grouping documents such that documents in a cluster are similar and documents in different cluster are dissimilar. Vector space model is used.
Algorithm

  1. Choose the value of k
  2. K objects are randomly chosen to form centroid of k clusters
  3. Repeat until no change in location of centroid or no change in objects assigned to cluster

Find distance of each object to cluster center and assign it to one with minimum
Calculate mean of each cluster group to compute new cluster centers

Let k=3 an initial cluster seeds be d2, d5 and d7.
calculating Euclidean distance between d1 and d2 :

The clusters are:
d2, d1, d6, d9, d10
d5, d8
d7, d3, d4


If you want something and you get something else, never be afraid. One day things will surely work your way!!

Sunday, 3 December 2017

Developing Devices that can See

Google AIY Vision Kit

Google has introduced the AIY Voice Kit back in May, and now the company has launched the AIY Vision Kit that has on-device neural network acceleration for Raspberry Pi. Unlike the Voice Kit, the Vision Kit is designed to run the all the machine learning locally on the device rather than talk to the cloud. While it was possible to run TensorFlow locally on the Raspberry Pi with the Voice Kit, the previous kit was far more suited to using Google’s Assistant API or their Cloud Speech API to do voice recognition. However the Vision Kit is designed from the ground up to run do all its image processing locally. The Vision kit includes a new circuit board, and computer vision software can be paired with Raspberry Pi computer and camera.
In addition to the Vision, users will need a Raspberry Pi Zero W, a Raspberry Pi Camera, an SD card and a power supply, that must be purchased separately. The Kit includes cardboard outer shell, the VisionBonnet circuit board, an RGB arcade-style button, a piezo speaker, a macro/wide lens kit, a tripod mounting nut and other connecting components.
The main component of the Vision Kit is the VisionBonnet that features the Intel Movidius MA2450 which is a low-power vision processing unit capable of running neural network models on-device. The software includes three models:

  • Model to recognize common objects
  • Model to recognize faces and their expressions
  • Person, cat and dog detector

Google has also included a tool to compile models for Vision Kit, and users can train their own models with Google’s TensorFlow machine learning software. The AIY Vision kit costs $44.99 and will ship from December 31st through Micro Center. This first batch is a limited run of just 2,000 units and is available in the US only.

AWS Deeplens

Amazon’s Deeplens device introduced at Amazon ReInvent is aimed, at software developers and data scientists using machine learning, and Amazon have packed a lot of power into it a 4 megapixel camera that can capture 1080P video, a 2D microphone array, and even an Intel Atom processor. Intended to sit connected to the mains and be used as a platform, as a tool. It’s a finished product, with a $250 price. It will be shipped only after April 2018. DeepLens uses Intel-optimized deep learning software tools and libraries (including the Intel Compute Library for Deep Neural Networks, Intel clDNN) to run real-time computer vision models directly on the device for reduced cost and real-time responsiveness. It supports major machine learning frameworks like Google’s TensorFlow, Facebook’s Caffe2, Pytorch, and Apache MXNET. DeepLens will be tightly integrated with other cloud and AI services sold by AWS.

The Amazon kit is aimed at developers looking to build and train deep learning models in the real world. The Google kit is aimed at makers looking to build projects, or even products. The introduction of TensorflowLite and the Google AIY Vision kit can be regarded as a recent trend in moving the computation to the device rather than the cloud.

Reference:

https://aiyprojects.withgoogle.com/
https://aws.amazon.com/deeplens/
https://aws.amazon.com/blogs/aws/deeplens/


Adopting the right attitude can convert a negative stress into a positive one