Saturday, 11 January 2020

Quantum Computing

Classical computers are composed of registers and memory. The collective contents of these are often referred to as state. Instructions for a classical computer acts based on this state. It consists of long string of bits, which encode either a zero or a one.
The world is running out of computing capacity. As Moore's law states, the number of transistors on a microprocessor continues to double every 18 months, hence it is high time for the introduction of atom sized microprocessors. So the next logical step will be to create quantum computers which will harness the power of atoms and molecules to perform the processing tasks.
The word quantum was derived from Latin, meaning 'how great' or 'how much'. The discovery that particles are discrete packets of energy with wave like properties led to the branch of Physics called Quantum Mechanics.
Quantum Computing is the usage of quantum mechanics to process information.

Qubits

The main advantage of quantum computers is that they aren't limited to two states since they use Qubits or Quantum Bits instead of bits.
Qubits represent atoms, ions, electrons or protons and their respective control devices that are working together to act as computer memory. Because a quantum computer can contain multiple states simultaneously, it has the potential to be millions of times more powerful than super computers.
Qubits are very hard to manipulate, any disturbance causes them to fall out of their quantum state. This is called decoherance. The field of quantum error correction examines the different ways to avoid decoherance.

Super position and Entanglement

Super position is the feature that frees up from binary constraints. It is the ability of a quantum system to be in multiple states at the same time.
Entanglement is an extremely strong correlation that exists between quantum particles. It is so strong that two or more quantum particles will be linked perfectly, even if separated by great distances.
A classical computer works with ones and zeroes, a quantum computer will have the advantage of using ones, zeroes and 'super positions of ones and zeroes' . This is why a quantum computer can process a vast number of calculations simultaneously. 

Error Correcting Codes

Quantum computers outperform classical computers on certain problems. But efforts to build them have been hampered by the fragility of qubits. This is because they are easily affected by heat and electro magnetic radiation. Error Correcting Codes are used for this. 

Quantum Computer 

Computation device that make use of quantum mechanical phenomena, such as Super position and Entanglement to perform operations on data is termed as Quantum Computer. Such a device is probabilistic rather than deterministic. So it returns multiple answers for a given problem and thereby indicates the confidence level of the computer. 

Applications of Quantum Computers 

1. They are great for solving optimization problems. 
2. They can easily crack encryption algorithms. 
3. Machine learning tasks such as NLP, image recognition etc

Quantum Programming Language(QPL) 

Quantum Programming Language is used to write programs for quantum computer. Quantum computer language is the most advanced implemented QPL. 
The basic built-in quantum data type in Quantum Computer Language is qreg(quantum register) which can be interpreted as an Array of qubits(quantum bits).

Interesting facts 

Quantum computers require extremely cold temperatures, as sub-atomic particles must be as close as possible to a stationary state to be measured. The cores of D-wave quantum computers operate at -460 degrees F or -273 degrees C which is 0.02 degrees away from absolute zero. 

Google's Quantum Supremacy

Quantum Supremacy or Quantum Eclipse is the way of demonstrating that a quantum computer is able to perform a calculation that is impossible for a classical one. 
Google's quantum computer called 'Sycamore' consist of only 54 qubits. It was able to complete a task called Random Circuit Problem. Google says Sycamore was able to find the answer in just a few minutes whereas it would take 10,000 years on the most powerful super computer. 

Amazon Braket

AWS announced New Quantum Computing Service (Amazon Braket) along with AWS Center for Quantum Computing and Amazon Quantum Solutions Lab at AWS re:Invent on 2 December, 2019. Amazon Braket is a fully managed service that helps to get started with quantum computing by providing a development environment to explore and design quantum algorithms, test them on quantum computers, and run them on quantum hardware.

References

Information collected from various sources on the Internet.


At times an impulse is required to trigger an action. Special Thanks to Amazon Braket for being an irresistible impulse to make me write this blog on Quantum Computing. 






Make everyday our Master Piece, 
Happy 2020!!


Saturday, 27 July 2019

Exploring Cassandra: Part- 1

Cassandra is an open source column family NoSQl database that is scalabale to handle massive volumes of data stored across commodity nodes. 

Why Cassandra?

Consider a scenario where we need to store large amounts of log data. Millions of log entries will be written everyday. It also requires a server with zero downtime.

Challenges with RDBMS


  • Cannot efficiently handle huge volumes of data
  • Difficult to serve users worldwide with the ceyralized single node model
  • Server with zero downtime

Using Cassandra


  • It is highly scalable and hence can handle large amounts of data
  • Most appropriate for write heavy work loads
  • Can handle millions of user requests per day
  • Can continue working even when nodes are down
  • Supports wide rows with a very flexible schema wherein all rows need not have the same number of columns

Cassandra Vs RDBMS



Cassandra Architecture


  • Cassandra follows a peer to peer master less architecture. So all the nodes in the cluster are considered equal.
  • Data is relicated on multiple nodes so as to ensure fault tolerance and high availability.
  • The node that recives client request is called the coordinator. The coordinator forwards the request to the appropriate node responsible for the given row key
  • Data Center: Collection of related nodes
  • Node: Place where the data is stored
  • Cluster: It contains one or more nodes

Applications of Cassandra


  • Suitable for high velocity data from sensors
  • Useful ot store time series data
  • Social media networking sites use Cassandra for analysis and recommendation of products to their customers
  • Preferred by companies providing messaging services for managing massive amounts of data



All good things are difficult to achieve; and bad things are very easy to get.

Sunday, 26 May 2019

Google Cloud Platform(GCP) : Part- 3

Interacting with GCP

There are four ways we can interact with Google Cloud Platform.

  • Console: The GCP Console is a web-based administrative interface. It lets us view and manage all the projects and all the resources they use. It also lets us enable, disable and explore the APIs of GCP services. And it gives us access to Cloud Shell. The GCP Console also includes a tool called the APIs Explorer that helps to learn about the APIs interactively. It let's us see what APIs are available and in what versions. These APIs expect parameters and documentation on them is built in.
  • SDK and Cloud Shell: Cloud Shell is a command-line interface to GCP that's easily accessed from the browser. From Cloud Shell, we can use the tools provided by the Google Cloud Software Development kit SDK without having to first install them somewhere. The Google Cloud SDK is a set of tools that we can use to manage our resources and applications on GCP. These include the gcloud tool which provides the main command line interface for Google Cloud Platform products and services. There's also gsutil which is for Google Cloud Storage and bq which is for BigQuery. The easiest way to get to the SDK commands is to click the Cloud Shell button on a GCP Console. We can also install the SDK on our own computers, our on-premise servers of virtual machines and other clouds. The SDK is also available as a docker image.
  • Mobile App
  • APIs: The services that make up GCP offer Restful application programming interfaces so that the code we write can control them. The GCP Console lets us turn on and off APIs. Many APIs are off by default, and many are associated with quotas and limits. These restrictions help protect us from using resources inadvertently. We can enable only those APIs we need and we can request increases in quotas when we need more resources.

Cloud Launcher

Google Cloud Launcher is a tool for quickly deploying functional software packages on Google Cloud platform. GCP updates the base images for the software packages to fix critical issues and vulnerabilities, but it doesn't update the software after it's been deployed.

Virtual Private Cloud(VPC)

Virtual machines  have the power in generality of a full-fledged operating system in each system. We can segment our networks, use firewall rules to restrict access to instances, and create static routes to forward traffic to specific destinations. Virtual Private Cloud networks that we define have global scope. We can dynamically increase the size of a subnet in a custom network by expanding the range of IP addresses allocated to it. 
Features of VPCs are:

  • VPCs have routing tables. These areused to forward traffic from one instance to another instance within the same network. Even across sub-networks and even between GCP zones without requiring an external IP address.
  • VPCs give us a global distributed firewall. We can control to restrict access to instances both incoming and outgoing traffic.
  • Cloud Load Balancing is a fully distributed software defined managed service for all our traffic. With Cloud Load Balancing, a single anycast IP front ends all our backend instances in regions around the world. It provides cross region load balancing, including automatic multi region failover, which gently moves traffic in fractions if backends become unhealthy. Cloud Load Balancing reacts quickly to changes in users, traffic, backend health, network conditions, and other related conditions.
  • Cloud DNS is a managed DNS service running on the same infrastructure as Google. It has low latency and high availability and it's a cost effective way to make our applications and services available to our users. The DNS information we publish is served from redundant locations around the world. Cloud DNS is also programmable. We can publish and manage millions of DNS zones and records using the GCP console, the command line interface or the API. Google has a global system of edgecaches. We can use this system to accelerate content delivery in your application using Google Cloud CDN.
  • Cloud router lets our other networks and our Google VPC exchange route information over the VPN using the Border Gateway Protocol.
  • Peering means putting a router in the same public data center as a Google point of presence and exchanging traffic. One downside of peering though is that it isn't covered by a Google service level agreement. Customers who want the highest uptimes for their interconnection with Google should use dedicated interconnect in which customers get one more direct private connections to Google. If these connections have topologies that meet Google's specifications, they can be covered by up to a 99.99 percent SLA.

Compute Engine

Computer engine lets us create and run virtual machines on Google infrastructure. We can create a virtual machine instance by using the Google cloud platform console or the gcloud command line tool. Once our VMs are running, it's easy to take a durable snapshot of their discs. We can keep these as backups or use them when we need to migrate a VM to another region. A preemptible VM is different from an ordinary compute engine VM in only one respect. We've given compute engine permission to terminate it if its resources are needed elsewhere. We can save a lot of money with preemptible VMs. Compute engine has a feature called auto scaling that lets us add and take away VMs from our application based on load metrics.

Don't limit your challenges. Challenge your limits..

Sunday, 14 April 2019

Hadoop : Part - 5


Security in Hadoop

Apache Hadoop achieves security by using Kerberos.
At a high level, there are three steps that a client must take to access a service when using Kerberos.

  • Authentication – The client authenticates itself to the authentication server. Then, receives a timestamped Ticket-Granting Ticket (TGT).
  • Authorization – The client uses the TGT to request a service ticket from the Ticket Granting Server.
  • Service Request – The client uses the service ticket to authenticate itself to the server.

Concurrent writes in HDFS

Multiple clients cannot write into an HDFS file at same time. Apache Hadoop HDFS follows single writer multiple reader models. The client which opens a file for writing, the NameNode grant a lease. Now suppose, some other client wants to write into that file. It asks NameNode for the write operation. NameNode first checks whether it has granted the lease for writing into that file to someone else or not. When someone already acquires the lease, then, it will reject the write request of the other client.

fsck

fsck is the File System Check. HDFS use the fsck (filesystem check) command to check for various inconsistencies. It also reports the problems like missing blocks for a file or under-replicated blocks. NameNode automatically corrects most of the recoverable failures. Filesystem check can run on the whole file system or on a subset of files.

Datanode failures

NameNode periodically receives a Heartbeat and a Blockreport from each of the DataNodes in the cluster. Receipt of a Heartbeat implies that the DataNode is functioning properly. A Blockreport contains a list of all blocks on a DataNode. When NameNode notices that it has not recieved a hearbeat message from a data node after a certain amount of time, the data node is marked as dead. Since blocks will be under replicated the system begins replicating the blocks that were stored on the dead datanode. The NameNode Orchestrates the replication of data blocks from one datanode to another. The replication data transfer happens directly between datanodes and the data never passes through the namenode.

Taskinstances

Task instances are the actual MapReduce jobs which are run on each slave node. Each Task Instance runs on its own JVM process. There can be multiple processes of task instance running on a slave node.

Communication to HDFS

The Client communication to HDFS happens using Hadoop HDFS API. Client applications talk to the NameNode whenever they wish to locate a file. The NameNode responds the successful requests by returning a list of relevant DataNode servers where the data lives. Client applications can talk directly to a DataNode, once the NameNode has provided the location of the data.

HDFS block and Inputsplit

Block is the physical representation of data while split is the logical representation of data present in the block.

Hadoop federation

HDFS Federation enhances an existing HDFS architecture. Hadoop Federation uses many independent Namenode/namespaces to scale the name service horizontally. It separates the namespace layer and the storage layer. Hence HDFS federation provides Isolation, Scalability and simple design.



Don't Give Up. The beginning is always the hardest but life rewards those who work hard for it.

Saturday, 6 April 2019

Hadoop : Part - 4


Speculative Execution

Instead of identifying and fixing the slow-running tasks, Hadoop tries to detect when the task runs slower than expected and then launches other equivalent task as backup. This backup mechanism in Hadoop is Speculative Execution.

Heartbeat in HDFS

A heartbeat is a signal indicating that it is alive. A datanode sends heartbeat to Namenode.

Hadoop archives

Hadoop Archives (HAR) offers an effective way to deal with the small files problem.
Hadoop Archives or HAR is an archiving facility that packs files in to HDFS blocks efficiently and hence HAR can be used to tackle the small files problem in Hadoop.
hadoop archive -archiveName myhar.har /input/location /output/location
Once a .har file is created, you can do a listing on the .har file and you will see it is made up of index files and part files. Part files are nothing but the original files concatenated together in to a big file. Index files are look up files which is used to look up the individual small files inside the big part files.
hadoop fs -ls /output/location/myhar.har
/output/location/myhar.har/_index
/output/location/myhar.har/_masterindex
/output/location/myhar.har/part-000000

Reason for setting HDFS blocksize as 128MB

The block size is the smallest unit of data that a file system can store. If the blocksize is smaller it requires multiple lookups on namenode to locate the file. HDFS is meant to handle large files. If the blocksize is 128MB, then the number of requests goes down, greatly reducing the cost of overhead and load on the Name Node.

Data Locality in Hadoop

Data locality refers to the ability to move the computation close to where the actual data resides on the node, instead of moving large data to computation. This minimizes network congestion and increases the overall throughput of the system.

Safemode in Hadoop

Safemode in Apache Hadoop is a maintenance state of NameNode. During which NameNode doesn’t allow any modifications to the file system. During Safemode, HDFS cluster is in read-only and doesn’t replicate or delete blocks.

Single Point of Failure

In Hadoop 1.0, NameNode is a single point of Failure (SPOF). If namenode fails, all clients would unable to read/write files.
Hadoop 2.0 overcomes this SPOF by providing support for multiple NameNode. This feature provides  If active NameNode fails, then Standby-Namenode takes all the responsibility of active node.
some deployment requires high degree fault-tolerance. So new version 3.0 enable this feature by allowing the user to run multiple standby namenode.



Strive for excellence and success will follow you.. 


Wednesday, 27 March 2019

Hadoop : Part - 3


Checkpoint node

Checkpoint Node keeps track of the latest checkpoint in a directory that has same structure as that of NameNode’s directory. Checkpoint node creates checkpoints for the namespace at regular intervals by downloading the edits and fsimage file from the NameNode and merging it locally.

Backup node

It maintains its up-to-date in-memory copy of the file system namespace that is in sync with the active NameNode.

Overwriting replication factor in HDFS

The replication factor in HDFS can be modified or overwritten in 2 ways-
  • Using the Hadoop FS Shell, replication factor can be changed per file basis using the below command-$hadoop fs –setrep –w 2 /my/test_file (test_file is the filename whose replication factor will be set to 2)
  • Using the Hadoop FS Shell, replication factor of all files under a given directory can be modified using the below command- $hadoop fs –setrep –w 5 /my/test_dir (test_dir is the name of the directory and all the files in this directory will have a replication factor set to 5

Edge nodes

Edges nodes or gateway nodes are the interface between hadoop cluster and the external network. Edge nodes are used for running cluster adminstration tools and client applications.

InputFormats in Hadoop

  • TextInputFormat
  • Key Value Input Format
  • Sequence File Input Format

Rack

It is the collection of machines around 40-50. All these machines are connected using the same network switch and if that network goes down then all machines in that rack will be out of service. Thus we say rack is down.

Rack awareness

The physical location of the data nodes is referred to as Rack in HDFS. The rack id of each data node is acquired by the NameNode. The process of selecting closer data nodes depending on the rack information is known as Rack Awareness.

Replica Placement Policy

The contents present in the file are divided into data block. After consulting with the NameNode, client allocates 3 data nodes for each data block. For each data block, there exists 2 copies in one rack and the third copy is present in another rack. This is generally referred to as the Replica Placement Policy.


In the middle of difficulty lies opportunity.. 

Sunday, 24 March 2019

Google Cloud Platform(GCP) : Part- 2

Multi layered security approach

Google  also  designs  custom  chips, including  a  hardware security  chip  called  Titan  that's currently  being  deployed on  both servers  and  peripherals.  Google server machines  use  cryptographic signatures  to  make  sure  they  are  booting  the  correct  software.  Google designs and  builds its  own  data centers  which  incorporate  multiple layers  of  physical  security protections.
Google's infrastructure  provides  cryptographic privacy  and  integrity  for  remote  procedure called data-on-the-network, which  is  how  Google  services  communicate  with  each  other.  The infrastructure automatically  encrypts  our  PC traffic in  transit between data  centers. Google  central identity  service  which  usually  manifests  to  end  users  as the  Google log  in page  goes  beyond  asking  for  a simple username  and  password. It  also  intelligently  challenges  users  for  additional information  based  on risk factors  such  as  whether they  have  logged  in from  the same  device  or a  similar location  in  the past. Users can  also  use  second  factors  when  signing  in, including  devices  based  on  the  universal  second factor  U2F  open  standard.
Google  services  that  want  to  make  themselves  available  on  the Internet  register  themselves with an infrastructure service  called  the  Google  front  end(GFE),  which  checks  incoming  network connections for correct  certificates  and  best  practices. The  GFE also  additionally,  applies  protections against  denial  of service  attacks.  The  scale  of its  infrastructure,  enables Google  to  simply  absorb  many  denial of service  attacks, even behind  the  GFEs.  Google  also  has  multi-tier,  multi-layer  denial of service protections  that  further reduce  the risk  of any  denial  of service  impact. Inside  Google's infrastructure, machine  intelligence  and  rules warn  of possible incidents. Google  conducts  Red  Team  exercises simulated  attacks  to  improve  the effectiveness  of it's  responses.
The principle of Least Privilege says that  each  user should  have  only  those  privileges needed to  do  their  jobs.  In  a  least privilege  environment, people are  protected  from  an  entire class  of  errors.
GCP customers use  IAM(Identity and Access Management) to  implement  least  privilege,  and  it  makes  everybody  happier.  There  are  four ways  to interact  with GCP's  management layer:

  • Web-based console
  • SDK
  • Command-line  tools
  • APIs
  • Mobile  app

GCP Resource Hierarchy

All the  resources we  use,  whether  they're  virtual machines,  cloud  storage  buckets, tables  and  big  query  or anything  else in  GCP  are  organized into  projects. Optionally,  these  projects  may  be  organized into  folders. Folders  can contain  other  folders. All the  folders  and  projects  used  by  our  organization  can  be  brought  together under  an  organization  node. Project  folders  and  organization  nodes are  all places where  the  policies  can be  defined.
All Google  Cloud  platform  resources belong  to  a  project.  Projects are  the basis for  enabling  and  using  GCP  services  like  managing  APIs, enabling  billing  and  adding  and  removing  collaborators and  enabling  other  Google  services. Each project is a separate  compartment  and  each  resource  belongs  to  exactly  one.  Projects  can  have  different owners and  users, they're  built  separately  and  they're  managed separately. Each  GCP  project  has  a name  and  a  project  ID  that  we  assign. The  project  ID  is a  permanent  unchangeable  identifier  and  it  has to  be  unique  across GCP.  We use  project  IDs in several contexts  to  tell GCP  which  project  we want to work  with.  On the  other  hand, project  names  are  for  our convenience  and  we can  assign  them. GCP also  assigns each  of  our projects  a unique  project  number.
Folders let  teams have  the  ability  to  delegate  administrative  rights,  so  they  can work  independently. The  resources  in  a  folder  inherit  IAM policies  from  the  folder. Organisation node is the  top  of  the  resource hierarchy.  There  are  some  special  roles associated  with  it.

Identity and Access Management(IAM)

IAM  lets administrators authorize  who  can  take  action  on  specific  resources. An  IAM  policy  has:
  • A who part
  • A  can  do 
  • What part
  • An on  which  resource  part
The  who  part names  the  user  or users.  The  who  part  of  an IAM  policy  can  be  defined either  by  a Google  account,  a Google  group, a Service  account,  an entire  G Suite,  or a Cloud  Identity  domain.  The  can  do  what  part is  defined by  an IAM  role. An  IAM  role  is  a collection  of permissions.
There  are  three  kinds of roles in Cloud IAM. Primitive  roles  can be applied to  a  GCP  project  and  they affect  all  resources  in that  project. These  are  the  owner,  editor, and  viewer  roles. A viewer  can examine  a given resource but  not  change  it's  state.  If you're  an  editor,  you  can  do  everything  a viewer  can  do, plus  change  its  state. And  owner can  do  everything  an editor  can  do, plus manage  rolls and  permissions  on  the  resource.  The  owner role  can set  up  billing.  Often, companies  want  someone  to  be  able  to  control the  billing  for  a  project without  the  right  to  change  the  resources  in  the project. And  that's why  we  can  grant  someone  the billing  administrator role.

IAM Roles

InstantAdmin  Role lets  whoever  has that  role  perform  a  certain  set  of actions  on virtual  machines.  The  actions are  listing  compute engines, reading  and  changing  their configurations,  and  starting and  stopping  them. We must manage  permissions for custom roles.  Some  companies decide they'd  rather  stick  with the  predefined roles.  Custom  roles can  only  be  used  at  the project  or  organization  levels. They  can't  be  used  at  the  folder  level. Service  accounts  are  named with an  email  address. But instead  of passwords, they  use cryptographic keys  to  access resources.


Be that one you always wanted to be..