Sunday 5 August 2018

Data Analysis with Python

Python is an open-source and object-oriented programming language developed by Guido Van Rossum in 1980s. While implementing Python, Guido Van Rossum was also reading the published scripts from “Monty Python's Flying Circus”, a BBC comedy series. Van Rossum thought he needed a name that was short and slightly mysterious, so he decided to call the language Python. Python is an interpreted language. Compiled code is the executable code in assembly language. But interpreted languages must be translated at run time to CPU machine instructions. At Google, python is one of the 3 "official languages" alongside with C++ and Java. They even have a developer portal devoted to Python, with free classes offered including exercises and lecture videos (https://developers.google.com/edu/python/). Python is also used as the configuration language for Tupperware, Facebook's container deployment system.

Installation

2 options:

  • Directly download and install Python
  • Download and install Anaconda

Development environment


  • Terminal or shell
  • IDLE(Integrated Development and Learning Environment)
  • iPython Notebook

Data structures


  • Lists: List of comma separated values in square brackets. Items must be of same type.
  • Strings: They are immutable. Enclosed between single('), double(") or triple(''') quotes. Strings enclosed in triple quotes can span over multiple lines. 
  • Tuples: A number of values separated by commas. They are surrounded by parenthesis. They are immutable. Tuples are faster in processing as compared to lists due to its immutable nature.
  • Dictionary: Set of Key:Value pairs enclosed in parenthesis. Keys are unique.

Python libraries for data analysis


  • Numpy: Numerical Python. It provides the n-dimensioanl array feature. It also includes basic linear algebra functions, Fourier transforms, advances random number capabilities and tools for inetgration withlow level languages like C, C++ and Fortran
  • Scipy: Scientific Python. Library for discrete Fourier transforms, linear algebra, optimization and sparse matrices. It is a module for science built on Numpy.
  • Matplotlib: It is used for plotting graphs.
  • Pandas: It is used for structured data operations and manipulations. The following data structures are included in Pandas:
- Series: These are one dimensional labelled arrays.
- Dataframes: These are two dimensional data structures. It has column names and row indices.
  • Scikit Learn: It is a library for Machine Learning built on Numpy, Scipy and Matplotlib. It inclued tools for Classification, Regression, Clustering and Dimensionality Reduction.
  • NLTK: Natural Language Processing Tool Kit. It is a library for Natural Language Processing.
  • Stats Models: It is a library used for statistical Modeling
  • Seaborn: It is a library based on Matplotlib for statistical data visualization.
  • Bokeh: It is used for creating interative plots and dashboards on web browsers. It can visualize large and streaming datasets.
  • Blaze: It is built on Numpy and Pandas for streaming and distributed datasets. It has connectors to Apache Spark, MongoDB etc.
  • Scrapy: It is used for web crawling. It can be used to extract information from all pages of a website.
  • Sympy: It is a library for symbolic computation. It has the capability of formatting the result of the computations as LaTeX code.
  • Astropy: It is a package for Astronomy in Python.
  • Biopython: It is a set of tools for biological computation. 


Friendship is born at that moment when one person says to another: 'What! You too? I thought I was the only one' ðŸ˜Š