Big Big Things in my Little Little World

Saturday, 20 March 2021

Setup of Spark Scala Program for WordCount in Windows10

Install Spark:

Download latest version of Spark from: https://www.apache.org/dyn/closer.lua/spark/spark-3.1.1/spark-3.1.1-bin-hadoop3.2.tgz

Unzip the same to a directory(Eg: C:/Program Files/)

Install Scala:

Spark 3.1.1 is compatible with Scala 3.12.10

Download the Windows binaries from https://www.scala-lang.org/download/2.12.10.html

Set up of Environment Variables:

SPARK_HOME: c:\progra~1\spark\spark-3.1.1-bin-hadoop3.2

SCALA_HOME: c:\progra~1\spark\spark-3.1.1-bin-hadoop3.2

Path: %Path%;%SCALA_HOME%\bin;%SPARK_HOME%\bin;

Download the sample code from: https://github.com/anjuprasannan/WordCountExampleSpark

Configuring Scala project in IntelliJ: https://docs.scala-lang.org/getting-started/intellij-track/getting-started-with-scala-in-intellij.html

Add Maven support by following the steps at: https://www.jetbrains.com/help/idea/convert-a-regular-project-into-a-maven-project.html

Modify the pom.xml file as per the Git repository.

Build the project as: mvn clean install

Edit the input and output directories in https://github.com/anjuprasannan/WordCountExampleSpark/blob/main/src/main/scala/WordCount.scala [Note that the output location should be a non existant directory.]

Execute the Application as Right click WordCount -> Run 'WordCount'

You can see the output directory created with the result.

"A life spent making mistakes is not only more honorable, but more useful than a life spent doing nothing."

Wednesday, 17 March 2021

Setup of Map Reduce Program for WordCount in Windows10

Java Download: https://www.oracle.com/in/java/technologies/javase/javase-jdk8-downloads.html

Maven Download: https://maven.apache.org/download.cgi

Maven installation: https://maven.apache.org/install.html

Eclipse download and installation: https://www.eclipse.org/downloads/download.php?file=/technology/epp/downloads/release/2020-12/R/eclipse-java-2020-12-R-win32-x86_64.zip

Hadoop Download:https://hadoop.apache.org/release/3.2.1.html

winutils: https://github.com/cdarlint/winutils/blob/master/hadoop-3.2.1/bin/winutils.exe. Download and copy to bin folder under hadoop

System Variables setup:

JAVA_HOME: C:\Program Files\Java\jdk1.8.0_281

HADOOP_HOME: C:\Program Files\hadoop\hadoop-3.2.1

Path: %PATH%;%JAVA_HOME%\bin;C:\Program Files\apache-maven-3.6.3\bin;%HADOOP_HOME%\sbin;

Map Reduce Code Setup:

Map Reduce Code: https://github.com/anjuprasannan/MapReduceExample

checkout the code from Git as:

git clone https://github.com/anjuprasannan/MapReduceExample.git

Import the project to Eclipse

Edit the input, output and hadoop home directory locations in "/MapReduceExample/src/com/anjus/mapreduceexample/WordCount.java" [Note that the output location should be a non existant directory.]

Build the project:

Right click project -> Maven Clean

Right click project -> Maven Install

Job Execution:

Right Click on "/MapReduceExample/src/com/anjus/mapreduceexample/WordCount.java" -> Run As -> Java Application

"Don't be afraid to give up the good to go for the great."

Monday, 18 January 2021

Particle Swarm Optimization

Artificial intelligence (AI) is the intelligence exhibited by machines. It is defined as “the study and design of intelligent agents”, where an intelligent agent represents a system that perceives its environment and takes action that maximizes its success chance. AI research is highly technical and specialized and is deeply divided into subfields that often fail to communicate with each other. Currently popular approaches of AI include traditional statistical methods, traditional symbolic AI, and computational intelligence (CI). CI is a fairly new research area. It is a set of nature-inspired computational methodologies and approaches to address complex real-world problems to which traditional approaches are ineffective or infeasible. CI includes artificial neural network (ANN), fuzzy logic, and evolutionary computation (EC).

Swarm intelligence (SI) is a part of EC. It researches the collective behavior of decentralized, self-organized systems, natural or artificial. Typical SI systems consist of a population of simple agents or boids interacting locally with one another and with their environment. The inspiration often comes from nature, especially biological systems. The agents in a SI system follow very simple rules. There is no centralized control structure dictating how individual agents should behave. The agents’ real behaviors are local, and to a certain degree random; however, interactions between such agents lead to the emergence of “intelligent” global behavior, which is unknown to the individual agents. Well-known examples of SI include ant colonies, bird flocking, animal herding, bacterial growth, and fish schooling.

Self-organization is a key feature of SI system. It is a process where global order or coordination arises out of the local interactions between the components of an initially disordered system. This process is spontaneous; that is, it is not controlled by any agent inside or outside of the system. The self-organization in swarms are interpreted through three basic ingredients as follows.

(1) Strong dynamical nonlinearity (often involving positive and negative feedback): positive feedback helps promote the creation of convenient structures, while negative feedback counterbalances positive feedback and helps to stabilize the collective pattern.

(2) Balance of exploitation and exploration: SI identifies a suitable balance to provide a valuable mean artificial creativity approach.

(3) Multiple interactions: agents in the swarm use information coming from neighbor agents so that the information spreads throughout the network.

Particle Swarm Optimization (PSO) is a technique used to explore the search space of a given problem to find the settings or parameters required to maximize a particular objective. In computational science, particle swarm optimization (PSO) is a computational method that optimizes a problem by iteratively trying to improve a candidate solution with regard to a given measure of quality. Particle Swarm Optimization (PSO), a population based technique for stochastic search in a multidimensional space, has so far been employed successfully for solving a variety of optimization problems including many multifaceted problems, where other popular methods like steepest descent, gradient descent, conjugate gradient, Newton method, etc. do not give satisfactory results. This technique was first described by James Kennedy and Russell C. Eberhart in 1995. The algorithm is inspired by social behavior of bird flocking and fish schooling. Suppose a group of birds is searching for food in an area and only one piece of food is available. Birds do not have any knowledge about the location of the food. But they know how far the food is from their present location. So the best strategy to locate the food is to follow the bird nearest to the food.

A flying bird has a position and a velocity at any time t. In search of food, the bird changes its position by adjusting the velocity. The velocity changes based on its past experience and also the feedbacks received from neighbouring birds. This searching process can be artificially simulated for solving non-linear optimization problem. So this is a population based stochastic optimization technique inspired by social behaviour of bird flocking or fish schooling. In this algorithm each solution is considered as bird, called particle. All the particles have a fitness value. The fitness values can be calculated using objective function. All the particles preserve their individual best performance and they also know the best performance of their group. They adjust their velocity considering their best performance and also considering the best performance of the best particle.

The usual aim of the particle swarm optimization (PSO) algorithm is to solve an unconstrained minimization problem: find x* such that f(x*)<=f(x) for all d-dimensional real vectors x. The objective function f: Rd-> R is called the fitness function. Each particle i in Swarm Topology has its neighborhood Ni (a subset of P). The structure of the neighborhoods is called the swarm topology, which can be represented by a graph. Usual topologies are: fully connected topology and circle topology.

The initial state of a four-particle PSO algorithm seeking the global maximum in a one-dimensional search space. The search space is composed of all the possible solutions. The PSO algorithm has no knowledge of the underlying objective function, and thus has no way of knowing if any of the candidate solutions are near to or far away from a local or global maximum. The PSO algorithm simply uses the objective function to evaluate its candidate solutions, and operates upon the resultant fitness values. Each particle maintains its position, composed of the candidate solution and its evaluated fitness, and its velocity. Additionally, it remembers the best fitness value it has achieved thus far during the operation of the algorithm, referred to as the individual best fitness, and the candidate solution that achieved this fitness, referred to as the individual best position or individual best candidate solution. Finally, the PSO algorithm maintains the best fitness value achieved among all particles in the swarm, called the global best fitness, and the candidate solution that achieved this fitness, called the global best position or global best candidate solution.

The PSO algorithm consists of just three steps, which are repeated until a stopping condition is met:

1. Evaluate the fitness of each particle

2. Update individual and global best fitnesses and positions

3. Update velocity and position of each particle

The first two steps are fairly trivial. Fitness evaluation is conducted by supplying the candidate solution to the objective function. Individual and global best fitnesses and positions are updated by comparing the newly evaluated fitnesses against the previous individual and global best fitnesses, and replacing the best fitnesses and positions as necessary. The velocity and position update step is responsible for the optimization ability of the PSO algorithm. Once the velocity for each particle is calculated, each particle’s position is updated by applying the new velocity to the particle’s previous position. This process is repeated until some stopping condition is met. Some common stopping conditions include: a preset number of iterations of the PSO algorithm, a number of iterations since the last update of the global best candidate solution, or a predefined target fitness value.

The same can be described in detailed steps as follows:

Step: 1 ==> Initialize particles

Step: 2 ==> Evaluate fitness of each particles

Step: 3 ==> Modify velocities based on previous best and global best positions

Step: 4 ==> Terminate criteria, if criteria is not satisfied then go to Step: 2.

Step: 5 ==> Stop

Unlike Genetic Algorithms(GA), PSOs do not change the population from generation to generation, but keep the same population, iteratively updating the positions of the members of the population (i.e., particles). PSOs have no operators of “mutation”, “recombination”, and no notion of the “survival of the fittest”. On the other hand, similarly to GAs, an important element of PSOs is that the members of the population “interact”, or “influence” each other.

PSO has several advantages, including fast convergence, few setting parameters, and simple and easy implementation; hence, it can be used to solve nonlinear, non differentiable, and multipeak optimization problems, particularly in science and engineering fields. As a powerful optimization technique, PSO has been extensively applied in different geotechnical engineering aspects such as slope stability analysis, pile and foundation engineering, rock and soil mechanics, and tunneling and underground space design. The fitness function can be non-differentiable (only values of the fitness function are used). The method can be applied to optimization problems of large dimensions, often producing quality solutions more rapidly than alternative methods.

The disadvantages of particle swarm optimization (PSO) algorithm are that it is easy to fall into local optimum in high-dimensional space and has a low convergence rate in the iterative process. There is no general convergence theory applicable to practical, multidimensional problems. For satisfactory results, tuning of input parameters and experimenting with various versions of the PSO method is sometimes necessary. Stochastic variability of the PSO results is very high for some problems and some values of the parameters. Also, some versions of the PSO method depend on the choice of the coordinate system.

To address above mentioned problems, many solutions are present and can be divided into the following three types.

(i) Major modifications, including quantum-behaved PSO, bare-bones PSO, chaotic PSO, fuzzy PSO, PSOTVAC, OPSO, SPSO, and topology.

(ii) Minor modifications, including constriction coefficient, velocity clamping, trap detection, adaptive parameter, fitness-scaling, surrogate modeling, cooperative

mechanism, boundary shifting, position resetting, entropy map, ecological behavior, jumping-out strategy, preference strategy, neighborhood learning, and local search.

(iii) Hybridization, PSO being hybridized with GA, SA, TS, AIS, ACO, HS, ABC, DE, and so forth.

The modifications of PSO are: QPSO(quantum-behaved PSO), BBPSO(bare-bones PSO), CPSO(chaotic PSO), FPSO(fuzzy PSO), AFPSO(adaptive FPSO), IFPSO(improved FPSO), PSO with time-varying acceleration coefficients(PSOTVAC), OPSO(opposition-based PSO) and SPSO(standard PSO).

The application categories of PSO are “electrical and electronic engineering,” “automation control systems,” “communication theory,” “operations research,” “mechanical engineering,” “fuel and energy,” “medicine,” “chemistry,” “biology”.

Reference: Information collected from various sources in Internet

If you are always trying to be normal you will never know how amazing you can be!!

Monday, 9 March 2020

DataOps

Way to DataOps

DataOps focuses on the end-to-end delivery of data. In the digital era, companies need to harness their data to derive competitive advantage. In addition, companies across all industries need to comply with new data privacy regulations. The need for DataOps can be summarised as follows:

More data is available than ever before
More users want access to more data in more combinations

DataOps

When development and operations don’t work in concert, it becomes hard to ship and maintain quality software at speed. This led to the need for DevOps.
DataOps is similar to DevOps, but centered around the strategic use of data, as opposed to shipping software. DataOps is an automated process oriented methodology used by Big Data teams to improve the quality and reduce the cycle time of Data Analytics. It applies to the entire data life cycle from data preparation to reporting.
It includes automating different stages of the work flow including BI, Data Science and Analytics.DataOps speeds up the production of applications running on Big Data processing frameworks.

Components

DataOps include the following components:

Data Engineering
Data Integration
Data security
Data Quality

DevOps and DataOps

DevOps is the collaboration between Developers, Operations and QA Engineers across the entire Application Delivery pipeline, from Design and Coding, to Testing and Production Support. While DataOps is a Data Management method that emphasizes communication, collaboration, integration and automation of processes between Data Engineers, Data Scientists and other Data professionals.

DevOps mission is to enable Developers and Managers to handle modern web based Application development and deployment. DataOps enables data professionals to optimize modern web based data storage and analytics.

DevOps focuses on continuous delivery by leveraging on demand IT resources and by automating Testing and Deployment. While DataOps tries to bring the same improvements to Data Analytics.

Steps to implement DataOps

The following are the 7 steps to implement DataOps:

Add Data and logic tests
Use a version control system
Branch and Merge Codebase
Use multiple environments
Reuse and containerize
Parameterize processing
Orchestrate data pipelines

The above details are based on the learnings gathered from different Internet sources.

Keep going, because you didn't come this far to come only this far..

Sunday, 1 March 2020

DevOps

DevOps is a Software Engineering practice that aims at unifying Software Development and Operation. As the name implies, it is a combination of Development and Operations. The main phases in each of these can be described as follows:

Dev: Plan - - > Create - - > Verify - - > Package
Ops: Release - - > Configure - - > Monitor

DevOps Culture

Devops is often described as a culture. Hence it consists of different aspects such as:

Engineer Empowerment - It gives engineers more responsibility over the typical application life cycle process starting from Development, Testing, Deployment, Monitoring and Be On Call.
Test Driven Development - It is the practice of writing Tests before writing code. This will help increase the quality of service and gives developers more confidence for faster and frequent code releases.
Automation - It involves the concept of automating everything that can be automated. This includes Test Automation, Infrastructure Automation, Deployment Automation etc.
Monitoring - This is the process of building monitoring alerts and monitoring the applications.

Challenges in DevOps

Challenges	DevOps Solution
Dev Challenges
Waiting time for Code Deployment	Continuous Integration ensures there is quick deployment of code, faster testing and speedy feedback
Pressure of work on old code	Since there is no waiting time to deploy the code, the developer focusses on building the current code
Ops Challenges
Difficult to maintain uptime of production environment	Containerization or Virtualization ensures that there is a simulated environment created to run the software containers and also offer great reliability for application uptime
Tools to automate infrastructure management are not effective	Configuration management helps to organize and execute configuration plans, consistently provision the system and proactively manage the infrastructure.
Number of servers to be monitored increases and hence it is difficult to diagnose the issues.	Continuous monitoring and feedback system is established through DevOps. Thus effective administration is assured

Periodic Table of DevOps Tools

Popular DevOps Tools

Some of the most popular DevOps tools are:

Git: Git is an open source, distributed and the most popular software versioning system. It works on client server model. Code can be downloaded from main repository simultaneously by various clients or developers.
Maven: Maven is build automation tool. It automates software build process & dependencies resolution. A Maven project is configured using a project object model or pom.xml file.
Ansible: Ansible is an open source application which is used for automated software provisioning, configuration management and application deployment. Ansible helps in controlling an automated cluster environment consisting of many machines.
Puppet: Puppet is an open source software configuration management, automated provisioning tool. It is an alternative to Ansible and provides better control over client machines. Puppet comes up with GUI which makes it easy to use than Ansible.
Docker: Docker is a containerization technology. Containers consist of all the applications with all of its dependencies. These containers can be deployed on any machine without caring about underlying host details.
Jenkins: Jenkins is an open source automation server written in java. Jenkins is used in creating continuous delivery pipelines.
Nagios: Nagios is used for continuous monitoring of infrastructure. Nagios helps in monitoring server, application and network. It provides a GUI interface to check various details like memory utilisation, fan speed, routing tables of switches, or state of SQL server.
Selenium: This is an open source automation testing framework used for automating the testing of web applications. Selenium is not a single tool but a suite of tools. There are four components of Selenium – Selenium IDE, RC, WebDriver, and Grid. Selenium is used to repeatedly execute testcases for applications without manual intervention and generate reports.
Chef: Chef is a configuration management tool. Chef is used to manage configuration like creating or removing a user, adding SSH key to a user present on multiple nodes, installing or removing a service, etc.
Kubernetes: Kubernetes is an open source container orchestration tool. It is developed by Google. It is used in continuous deployment and auto scaling of container clusters. It increases fault tolerance, load balancing in a container cluster.

References

https://xebialabs.com/periodic-table-of-devops-tools/

Can DevOps can be incorporated in Data Management?
More details regarding the same will be discussed in the next post.

Also Thank you to one of my former colleagues for inspiring me to write a post on this hot topic.

She woke up every morning with the option of being anyone she wished, how beautiful it was that she always choose herself...

Saturday, 11 January 2020

Quantum Computing

Classical computers are composed of registers and memory. The collective contents of these are often referred to as state. Instructions for a classical computer acts based on this state. It consists of long string of bits, which encode either a zero or a one.
The world is running out of computing capacity. As Moore's law states, the number of transistors on a microprocessor continues to double every 18 months, hence it is high time for the introduction of atom sized microprocessors. So the next logical step will be to create quantum computers which will harness the power of atoms and molecules to perform the processing tasks.
The word quantum was derived from Latin, meaning 'how great' or 'how much'. The discovery that particles are discrete packets of energy with wave like properties led to the branch of Physics called Quantum Mechanics.
Quantum Computing is the usage of quantum mechanics to process information.

Qubits

The main advantage of quantum computers is that they aren't limited to two states since they use Qubits or Quantum Bits instead of bits.
Qubits represent atoms, ions, electrons or protons and their respective control devices that are working together to act as computer memory. Because a quantum computer can contain multiple states simultaneously, it has the potential to be millions of times more powerful than super computers.
Qubits are very hard to manipulate, any disturbance causes them to fall out of their quantum state. This is called decoherance. The field of quantum error correction examines the different ways to avoid decoherance.

Super position and Entanglement

Super position is the feature that frees up from binary constraints. It is the ability of a quantum system to be in multiple states at the same time.
Entanglement is an extremely strong correlation that exists between quantum particles. It is so strong that two or more quantum particles will be linked perfectly, even if separated by great distances.
A classical computer works with ones and zeroes, a quantum computer will have the advantage of using ones, zeroes and 'super positions of ones and zeroes' . This is why a quantum computer can process a vast number of calculations simultaneously.

Error Correcting Codes

Quantum computers outperform classical computers on certain problems. But efforts to build them have been hampered by the fragility of qubits. This is because they are easily affected by heat and electro magnetic radiation. Error Correcting Codes are used for this.

Quantum Computer

Computation device that make use of quantum mechanical phenomena, such as Super position and Entanglement to perform operations on data is termed as Quantum Computer. Such a device is probabilistic rather than deterministic. So it returns multiple answers for a given problem and thereby indicates the confidence level of the computer.

Applications of Quantum Computers

1. They are great for solving optimization problems.
2. They can easily crack encryption algorithms.
3. Machine learning tasks such as NLP, image recognition etc

Quantum Programming Language(QPL)

Quantum Programming Language is used to write programs for quantum computer. Quantum computer language is the most advanced implemented QPL.
The basic built-in quantum data type in Quantum Computer Language is qreg(quantum register) which can be interpreted as an Array of qubits(quantum bits).

Interesting facts

Quantum computers require extremely cold temperatures, as sub-atomic particles must be as close as possible to a stationary state to be measured. The cores of D-wave quantum computers operate at -460 degrees F or -273 degrees C which is 0.02 degrees away from absolute zero.

Google's Quantum Supremacy

Quantum Supremacy or Quantum Eclipse is the way of demonstrating that a quantum computer is able to perform a calculation that is impossible for a classical one.
Google's quantum computer called 'Sycamore' consist of only 54 qubits. It was able to complete a task called Random Circuit Problem. Google says Sycamore was able to find the answer in just a few minutes whereas it would take 10,000 years on the most powerful super computer.

Amazon Braket

AWS announced New Quantum Computing Service (Amazon Braket) along with AWS Center for Quantum Computing and Amazon Quantum Solutions Lab at AWS re:Invent on 2 December, 2019. Amazon Braket is a fully managed service that helps to get started with quantum computing by providing a development environment to explore and design quantum algorithms, test them on quantum computers, and run them on quantum hardware.

References

Information collected from various sources on the Internet.

At times an impulse is required to trigger an action. Special Thanks to Amazon Braket for being an irresistible impulse to make me write this blog on Quantum Computing.

Make everyday our Master Piece,

Happy 2020!!

Saturday, 27 July 2019

Exploring Cassandra: Part- 1

Cassandra is an open source column family NoSQl database that is scalabale to handle massive volumes of data stored across commodity nodes.

Why Cassandra?

Consider a scenario where we need to store large amounts of log data. Millions of log entries will be written everyday. It also requires a server with zero downtime.

Challenges with RDBMS

Cannot efficiently handle huge volumes of data
Difficult to serve users worldwide with the ceyralized single node model
Server with zero downtime

Using Cassandra

It is highly scalable and hence can handle large amounts of data
Most appropriate for write heavy work loads
Can handle millions of user requests per day
Can continue working even when nodes are down
Supports wide rows with a very flexible schema wherein all rows need not have the same number of columns

Cassandra Vs RDBMS

Cassandra Architecture

Cassandra follows a peer to peer master less architecture. So all the nodes in the cluster are considered equal.
Data is relicated on multiple nodes so as to ensure fault tolerance and high availability.
The node that recives client request is called the coordinator. The coordinator forwards the request to the appropriate node responsible for the given row key
Data Center: Collection of related nodes
Node: Place where the data is stored
Cluster: It contains one or more nodes

Applications of Cassandra

Suitable for high velocity data from sensors
Useful ot store time series data
Social media networking sites use Cassandra for analysis and recommendation of products to their customers
Preferred by companies providing messaging services for managing massive amounts of data

All good things are difficult to achieve; and bad things are very easy to get.

Sunday, 26 May 2019

Google Cloud Platform(GCP) : Part- 3

Interacting with GCP

There are four ways we can interact with Google Cloud Platform.

Console: The GCP Console is a web-based administrative interface. It lets us view and manage all the projects and all the resources they use. It also lets us enable, disable and explore the APIs of GCP services. And it gives us access to Cloud Shell. The GCP Console also includes a tool called the APIs Explorer that helps to learn about the APIs interactively. It let's us see what APIs are available and in what versions. These APIs expect parameters and documentation on them is built in.
SDK and Cloud Shell: Cloud Shell is a command-line interface to GCP that's easily accessed from the browser. From Cloud Shell, we can use the tools provided by the Google Cloud Software Development kit SDK without having to first install them somewhere. The Google Cloud SDK is a set of tools that we can use to manage our resources and applications on GCP. These include the gcloud tool which provides the main command line interface for Google Cloud Platform products and services. There's also gsutil which is for Google Cloud Storage and bq which is for BigQuery. The easiest way to get to the SDK commands is to click the Cloud Shell button on a GCP Console. We can also install the SDK on our own computers, our on-premise servers of virtual machines and other clouds. The SDK is also available as a docker image.
Mobile App
APIs: The services that make up GCP offer Restful application programming interfaces so that the code we write can control them. The GCP Console lets us turn on and off APIs. Many APIs are off by default, and many are associated with quotas and limits. These restrictions help protect us from using resources inadvertently. We can enable only those APIs we need and we can request increases in quotas when we need more resources.

Cloud Launcher

Google Cloud Launcher is a tool for quickly deploying functional software packages on Google Cloud platform. GCP updates the base images for the software packages to fix critical issues and vulnerabilities, but it doesn't update the software after it's been deployed.

Virtual Private Cloud(VPC)

Virtual machines have the power in generality of a full-fledged operating system in each system. We can segment our networks, use firewall rules to restrict access to instances, and create static routes to forward traffic to specific destinations. Virtual Private Cloud networks that we define have global scope. We can dynamically increase the size of a subnet in a custom network by expanding the range of IP addresses allocated to it.
Features of VPCs are:

VPCs have routing tables. These areused to forward traffic from one instance to another instance within the same network. Even across sub-networks and even between GCP zones without requiring an external IP address.
VPCs give us a global distributed firewall. We can control to restrict access to instances both incoming and outgoing traffic.
Cloud Load Balancing is a fully distributed software defined managed service for all our traffic. With Cloud Load Balancing, a single anycast IP front ends all our backend instances in regions around the world. It provides cross region load balancing, including automatic multi region failover, which gently moves traffic in fractions if backends become unhealthy. Cloud Load Balancing reacts quickly to changes in users, traffic, backend health, network conditions, and other related conditions.
Cloud DNS is a managed DNS service running on the same infrastructure as Google. It has low latency and high availability and it's a cost effective way to make our applications and services available to our users. The DNS information we publish is served from redundant locations around the world. Cloud DNS is also programmable. We can publish and manage millions of DNS zones and records using the GCP console, the command line interface or the API. Google has a global system of edgecaches. We can use this system to accelerate content delivery in your application using Google Cloud CDN.
Cloud router lets our other networks and our Google VPC exchange route information over the VPN using the Border Gateway Protocol.
Peering means putting a router in the same public data center as a Google point of presence and exchanging traffic. One downside of peering though is that it isn't covered by a Google service level agreement. Customers who want the highest uptimes for their interconnection with Google should use dedicated interconnect in which customers get one more direct private connections to Google. If these connections have topologies that meet Google's specifications, they can be covered by up to a 99.99 percent SLA.

Compute Engine

Computer engine lets us create and run virtual machines on Google infrastructure. We can create a virtual machine instance by using the Google cloud platform console or the gcloud command line tool. Once our VMs are running, it's easy to take a durable snapshot of their discs. We can keep these as backups or use them when we need to migrate a VM to another region. A preemptible VM is different from an ordinary compute engine VM in only one respect. We've given compute engine permission to terminate it if its resources are needed elsewhere. We can save a lot of money with preemptible VMs. Compute engine has a feature called auto scaling that lets us add and take away VMs from our application based on load metrics.

Don't limit your challenges. Challenge your limits..

Sunday, 14 April 2019

Hadoop : Part - 5

Security in Hadoop

Apache Hadoop achieves security by using Kerberos.
At a high level, there are three steps that a client must take to access a service when using Kerberos.

Authentication – The client authenticates itself to the authentication server. Then, receives a timestamped Ticket-Granting Ticket (TGT).

Authorization – The client uses the TGT to request a service ticket from the Ticket Granting Server.

Service Request – The client uses the service ticket to authenticate itself to the server.

Concurrent writes in HDFS

Multiple clients cannot write into an HDFS file at same time. Apache Hadoop HDFS follows single writer multiple reader models. The client which opens a file for writing, the NameNode grant a lease. Now suppose, some other client wants to write into that file. It asks NameNode for the write operation. NameNode first checks whether it has granted the lease for writing into that file to someone else or not. When someone already acquires the lease, then, it will reject the write request of the other client.

fsck

fsck is the File System Check. HDFS use the fsck (filesystem check) command to check for various inconsistencies. It also reports the problems like missing blocks for a file or under-replicated blocks. NameNode automatically corrects most of the recoverable failures. Filesystem check can run on the whole file system or on a subset of files.

Datanode failures

NameNode periodically receives a Heartbeat and a Blockreport from each of the DataNodes in the cluster. Receipt of a Heartbeat implies that the DataNode is functioning properly. A Blockreport contains a list of all blocks on a DataNode. When NameNode notices that it has not recieved a hearbeat message from a data node after a certain amount of time, the data node is marked as dead. Since blocks will be under replicated the system begins replicating the blocks that were stored on the dead datanode. The NameNode Orchestrates the replication of data blocks from one datanode to another. The replication data transfer happens directly between datanodes and the data never passes through the namenode.

Taskinstances

Task instances are the actual MapReduce jobs which are run on each slave node. Each Task Instance runs on its own JVM process. There can be multiple processes of task instance running on a slave node.

Communication to HDFS

The Client communication to HDFS happens using Hadoop HDFS API. Client applications talk to the NameNode whenever they wish to locate a file. The NameNode responds the successful requests by returning a list of relevant DataNode servers where the data lives. Client applications can talk directly to a DataNode, once the NameNode has provided the location of the data.

HDFS block and Inputsplit

Block is the physical representation of data while split is the logical representation of data present in the block.

Hadoop federation

HDFS Federation enhances an existing HDFS architecture. Hadoop Federation uses many independent Namenode/namespaces to scale the name service horizontally. It separates the namespace layer and the storage layer. Hence HDFS federation provides Isolation, Scalability and simple design.

Don't Give Up. The beginning is always the hardest but life rewards those who work hard for it.

Saturday, 6 April 2019

Hadoop : Part - 4

Speculative Execution

Instead of identifying and fixing the slow-running tasks, Hadoop tries to detect when the task runs slower than expected and then launches other equivalent task as backup. This backup mechanism in Hadoop is Speculative Execution.

Heartbeat in HDFS

A heartbeat is a signal indicating that it is alive. A datanode sends heartbeat to Namenode.

Hadoop archives

Hadoop Archives (HAR) offers an effective way to deal with the small files problem.
Hadoop Archives or HAR is an archiving facility that packs files in to HDFS blocks efficiently and hence HAR can be used to tackle the small files problem in Hadoop.
hadoop archive -archiveName myhar.har /input/location /output/location
Once a .har file is created, you can do a listing on the .har file and you will see it is made up of index files and part files. Part files are nothing but the original files concatenated together in to a big file. Index files are look up files which is used to look up the individual small files inside the big part files.
hadoop fs -ls /output/location/myhar.har
/output/location/myhar.har/_index
/output/location/myhar.har/_masterindex
/output/location/myhar.har/part-000000

Reason for setting HDFS blocksize as 128MB

The block size is the smallest unit of data that a file system can store. If the blocksize is smaller it requires multiple lookups on namenode to locate the file. HDFS is meant to handle large files. If the blocksize is 128MB, then the number of requests goes down, greatly reducing the cost of overhead and load on the Name Node.

Data Locality in Hadoop

Data locality refers to the ability to move the computation close to where the actual data resides on the node, instead of moving large data to computation. This minimizes network congestion and increases the overall throughput of the system.

Safemode in Hadoop

Safemode in Apache Hadoop is a maintenance state of NameNode. During which NameNode doesn’t allow any modifications to the file system. During Safemode, HDFS cluster is in read-only and doesn’t replicate or delete blocks.

Single Point of Failure

In Hadoop 1.0, NameNode is a single point of Failure (SPOF). If namenode fails, all clients would unable to read/write files.
Hadoop 2.0 overcomes this SPOF by providing support for multiple NameNode. This feature provides If active NameNode fails, then Standby-Namenode takes all the responsibility of active node.
some deployment requires high degree fault-tolerance. So new version 3.0 enable this feature by allowing the user to run multiple standby namenode.

Strive for excellence and success will follow you..

Wednesday, 27 March 2019

Hadoop : Part - 3

Checkpoint node

Checkpoint Node keeps track of the latest checkpoint in a directory that has same structure as that of NameNode’s directory. Checkpoint node creates checkpoints for the namespace at regular intervals by downloading the edits and fsimage file from the NameNode and merging it locally.

Backup node

It maintains its up-to-date in-memory copy of the file system namespace that is in sync with the active NameNode.

Overwriting replication factor in HDFS

The replication factor in HDFS can be modified or overwritten in 2 ways-

Using the Hadoop FS Shell, replication factor can be changed per file basis using the below command-$hadoop fs –setrep –w 2 /my/test_file (test_file is the filename whose replication factor will be set to 2)
Using the Hadoop FS Shell, replication factor of all files under a given directory can be modified using the below command- $hadoop fs –setrep –w 5 /my/test_dir (test_dir is the name of the directory and all the files in this directory will have a replication factor set to 5

Edge nodes

Edges nodes or gateway nodes are the interface between hadoop cluster and the external network. Edge nodes are used for running cluster adminstration tools and client applications.

InputFormats in Hadoop

TextInputFormat

Key Value Input Format

Sequence File Input Format

Rack

It is the collection of machines around 40-50. All these machines are connected using the same network switch and if that network goes down then all machines in that rack will be out of service. Thus we say rack is down.

Rack awareness

The physical location of the data nodes is referred to as Rack in HDFS. The rack id of each data node is acquired by the NameNode. The process of selecting closer data nodes depending on the rack information is known as Rack Awareness.

Replica Placement Policy

The contents present in the file are divided into data block. After consulting with the NameNode, client allocates 3 data nodes for each data block. For each data block, there exists 2 copies in one rack and the third copy is present in another rack. This is generally referred to as the Replica Placement Policy.

In the middle of difficulty lies opportunity..

Sunday, 24 March 2019

Google Cloud Platform(GCP) : Part- 2

Multi layered security approach

Google also designs custom chips, including a hardware security chip called Titan that's currently being deployed on both servers and peripherals. Google server machines use cryptographic signatures to make sure they are booting the correct software. Google designs and builds its own data centers which incorporate multiple layers of physical security protections.
Google's infrastructure provides cryptographic privacy and integrity for remote procedure called data-on-the-network, which is how Google services communicate with each other. The infrastructure automatically encrypts our PC traffic in transit between data centers. Google central identity service which usually manifests to end users as the Google log in page goes beyond asking for a simple username and password. It also intelligently challenges users for additional information based on risk factors such as whether they have logged in from the same device or a similar location in the past. Users can also use second factors when signing in, including devices based on the universal second factor U2F open standard.
Google services that want to make themselves available on the Internet register themselves with an infrastructure service called the Google front end(GFE), which checks incoming network connections for correct certificates and best practices. The GFE also additionally, applies protections against denial of service attacks. The scale of its infrastructure, enables Google to simply absorb many denial of service attacks, even behind the GFEs. Google also has multi-tier, multi-layer denial of service protections that further reduce the risk of any denial of service impact. Inside Google's infrastructure, machine intelligence and rules warn of possible incidents. Google conducts Red Team exercises simulated attacks to improve the effectiveness of it's responses.
The principle of Least Privilege says that each user should have only those privileges needed to do their jobs. In a least privilege environment, people are protected from an entire class of errors.
GCP customers use IAM(Identity and Access Management) to implement least privilege, and it makes everybody happier. There are four ways to interact with GCP's management layer:

Web-based console

SDK

Command-line tools

APIs

Mobile app

GCP Resource Hierarchy

All the resources we use, whether they're virtual machines, cloud storage buckets, tables and big query or anything else in GCP are organized into projects. Optionally, these projects may be organized into folders. Folders can contain other folders. All the folders and projects used by our organization can be brought together under an organization node. Project folders and organization nodes are all places where the policies can be defined.
All Google Cloud platform resources belong to a project. Projects are the basis for enabling and using GCP services like managing APIs, enabling billing and adding and removing collaborators and enabling other Google services. Each project is a separate compartment and each resource belongs to exactly one. Projects can have different owners and users, they're built separately and they're managed separately. Each GCP project has a name and a project ID that we assign. The project ID is a permanent unchangeable identifier and it has to be unique across GCP. We use project IDs in several contexts to tell GCP which project we want to work with. On the other hand, project names are for our convenience and we can assign them. GCP also assigns each of our projects a unique project number.
Folders let teams have the ability to delegate administrative rights, so they can work independently. The resources in a folder inherit IAM policies from the folder. Organisation node is the top of the resource hierarchy. There are some special roles associated with it.

Identity and Access Management(IAM)

IAM lets administrators authorize who can take action on specific resources. An IAM policy has:

A who part

A can do

What part

An on which resource part

The who part names the user or users. The who part of an IAM policy can be defined either by a Google account, a Google group, a Service account, an entire G Suite, or a Cloud Identity domain. The can do what part is defined by an IAM role. An IAM role is a collection of permissions.
There are three kinds of roles in Cloud IAM. Primitive roles can be applied to a GCP project and they affect all resources in that project. These are the owner, editor, and viewer roles. A viewer can examine a given resource but not change it's state. If you're an editor, you can do everything a viewer can do, plus change its state. And owner can do everything an editor can do, plus manage rolls and permissions on the resource. The owner role can set up billing. Often, companies want someone to be able to control the billing for a project without the right to change the resources in the project. And that's why we can grant someone the billing administrator role.

IAM Roles

InstantAdmin Role lets whoever has that role perform a certain set of actions on virtual machines. The actions are listing compute engines, reading and changing their configurations, and starting and stopping them. We must manage permissions for custom roles. Some companies decide they'd rather stick with the predefined roles. Custom roles can only be used at the project or organization levels. They can't be used at the folder level. Service accounts are named with an email address. But instead of passwords, they use cryptographic keys to access resources.

Be that one you always wanted to be..