Big Big Things in my Little Little World

Friday, 16 September 2016

Prisma & Machine Learning

Hello... You might be wondering how I switched the topic from Natural language API to deep learning! So let me make that clear. Whatever concepts I read everyday become's my new thought for that day. That's it. Even from the last couple of months I could see that most of my social media friends has a tag Prisma in their images. So I just thought of investigating the internals of the app. Prisma is a Russian app which makes use of neural networks to turn images into paintings. It is similar to Google's Deep Dream image recognition software. While we upload an image to the app, the image is transferred to its servers in Moscow. It uses Artificial Intelligence and Neural networks to process the image and the result is returned to the phone.

Deep learning is a branch of Machine Learning. It consist of set of algorithms to model high level abstractions on the data. Some of the deep learning architectures include: deep neural networks, convolutional neural networks, deep belief networks and recurrent neural networks.Neural networks are used to perform the tasks that can be easily performed by humans but difficult by machines. Neural networks acquire knowledge by learning and this information is used to model outputs for the future inputs.

The different learning strategies can be divided into three namely:

Supervised learning: This involves providing set of predefined inputs and outputs for learning. Eg: Face recognition
Unsupervised learning: It is used when we don't have an example dataset with known answers. Eg: Clustering
Reinforcement learning: This is a strategy built on observation. Eg: Robotics

Neural networks are used in the following fields:

Pattern recognition: The most common example is facial recognition
Time series prediction: Popular example is predicting the ups and downs of stock markets
Control: It involves the design of self driving cars
Signal Processing: one of the best example is designing of cochlear implants

Sources:

Every time we are being redirected to something better!

Monday, 12 September 2016

Google Cloud Natural Language Processing API

Couple of months ago I got a mail from the Google Cloud team regarding their new product launch. Due to the inborn curiosity I started Googling to find what it is. So let me share my thoughts on the same.

Google is consistently making advancements in the machine learning field. In the last year it open sourced the software library for machine Learning named Tensorflow. Then in earlier this year they introduced SyntaxNet which is a neural-network Natural Language Processing framework for TensorFlow. Now the Cloud Natural Language Processing API.

This REST API reveals the structure and meaning of the text. Initially it supports the following Natural Language Processing tasks:

Entity Recognition: Identify the different entity types such as Person, Location, Organisation, Events etc. from the text.
Sentiment Analysis: Understand the overall sentiment of the given text.
Syntax Analysis: Identify Parts of Speech and create Dependency parse tree for the input sentence.

The primary languages supported by the API are English, Spanish and Japanese. It has connectors in Java, Python and node.js. One of the major Alpha customers for this API is the Ocado Technology which is a popular British online marketplace.

If we are particular that we need to use the Google stack for analytics then natural language processing can be done using the Google Cloud Natural Language API,the processing results can be kept in Big Query Table which is a RESTful webservice for data storage provided by Google and the visualization can be done using Google Data Studio. Please note that the Google Data Studio is currently available only in U.S.

Happy reading!

Dream up to the Stars so that you can end up in the Clouds.

Thursday, 1 September 2016

Selfie Mining

Now a days it's common that personal stories are described using social images. We might be thinking the pictures we snapped of ourselves and posted on social media sites are just for our friends on those platforms. But it's high time to correct this misbelief. Only those data we mark as private are actually guarded by the privacy laws. The rest all is public. Marketers are grabbing our images for research. This process is called selfie mining.
When we take a picture of ourselves we do so without promoting a specific product in mind. But that is not the case with marketers. They might be interested in our clothing, products we use, emotions on our face etc. There are companies that mine for selfies. They use APIs to access the images and the most interesting aspect of it is that the owners are unaware of this. Actually intentionally or unintentionally selfies promote whatever we are wearing or are sitting near or using. Many digital marketing companies have built technology to scan and process photos, to identify particular interests or hobbies. This in turn helps to better target advertisers.
Two of these companies are Ditto Labs and Piqora.

Ditto Labs: It scans photos on different sites like Instagram to generate insights for customers. Ditto Labs places users into categories, such as “sports fans” and “foodies” based on the context of their images. Advertisers such as Kraft Foods Group Inc. pay Ditto Labs to find their products’ logos in photos on social media.The following aspects are taken into consideration:

Products- Users who post images of food items and beverages are flagged for these interests.
Clothing- Ditto classifies objects. It also detects fabrics or patterns in clothing.
Faces- The emotions in the face help advertisers to understand sentiment.
Logos- Advertisers can search for photos featuring brands to steal customers.
Scenes- Analysing the background of images helps the advertisers to find where and how customers use their products.

Piqora: They store images for months on their own servers to show marketers what is trending in popularity. Piqora mainly analyses images in Pinterest. It was recently acquired by Olapic which analyses images on Instagram.
Well, these indicates that some of the best digital marketing trends are all on the way. Let's hope that the best is yet to come in near future.

Source:
http://programmaticadvertising.org/2014/10/20/selfie-mining-whats-really-going-on/
http://www.wsj.com/articles/smile-marketing-firms-are-mining-your-selfies-1412882222

We anyways have to think, why not think big?

Saturday, 23 April 2016

The Buzz word Big Data..

Hello friends,

Sorry for the late post. I was busy in completing a significant milestone in my life. So let me start, as I promised, we will see some facts about Big Data. Today's world can be best described as "we are drowning in data but are starving for information". The data from web, social media sites, sensors, logs etc. is so large so that we cannot handle it by traditional data processing methods. Trillions of status updates are posted to Facebook daily. We see large data, but have we thought of how to store it. It's high time to think of such data storage techniques. Here comes the importance of big data. Let me go back to my good old school days.

What is Big Data?

According to Wikipedia: Big data is a term used to describe large and complex data sets that cannot be handles by other traditional data processing applications. It also includes capture, storage, search, analysis and visualization of the data.

Why Big Data?

Organisations can gain significant advantages by managing data effectively. Some of them include:

Better decision making: With the speed of data processing techniques like hadoop and applying in memory analytics we could combine data from multiple sources which would yield better decision making.
More business opportunities: We can mine the data and can find the customer needs and their satisfaction. This in turn can help us to develop the products that the customers want.
Cost reduction: Majority of the data processing techniques are open source. This reduces the cost of analysis of large data sets.

How is Big data?

Big Data can be best described by its characteristics. They are:

Volume: It describes the quantity of the data
Velocity: It is the speed at which the data is generated
Variety: It is the nature of the data
Veracity: It describes the quality of the data
Variability: The inconsistency in the data set is described by variability

When is Big Data?

The data stored in traditional databases cannot be regarded as big data. When we are dealing with terabytes and petabytes of information coming from diverse sources within short span of time, then it can be regarded as big data. In other words it can also be described as those data that satisfies the 5V's of big data or the characteristics of big data.

Too much of theory for the day. We can look at the tools and techniques used in Big Data Analytics in the next post. So just wait and see. Catch you all soon..

No Dream is too High!

Thursday, 7 January 2016

Tips and Tricks- 8

1. Tomcat Server not starting with in 45 seconds
Delete the server from eclipse and reconfigure it or add it again to Eclipse.

2. Accessing Apache tomcat 7 built in Host Manager GUI

Change the Tomcat\conf\tomcat-users.xml file as follows:
<role rolename="manager-gui"/> <user username="admin" password="password" roles="manager-gui"/>

Start tomcat and access: http://localhost:8080/manager/html with the provided username and password

3. HTTP Status 405 – HTTP method GET is not supported by this URL
The reasons are:
1) You do not have a valid doGet() method, when you type the servlet’s path in address bar directly, the web container like Tomcat will try to invoke the doGet() method.
public void doGet(HttpServletRequest request, HttpServletResponse response) throws IOException{

}

2) You made a HTTP post request from a HTML form, but you do not have a doPost() method to handle it. The doGet() can not handle the “Post” request.
public void doPost(HttpServletRequest request, HttpServletResponse response) throws IOException{

}

4. Tomcat Server is not starting and there are no output logs.
Check whether the JAVA_HOME is properly set.

Every Time we are being redirected to something better!

Sunday, 20 December 2015

More about Weka

Getting Started

1 Weka Installation

Weka requires Java. We may already have Java installed and if not, there are versions of Weka listed on the download page that include Java. The latest version of Weka can be downloaded from http://www.cs.waikato.ac.nz/ml/weka/

2 User Interfaces

The GUI chooser in Weka consist of four options:
• Explorer: It is an environment for exploring data with WEKA.
• Experimenter: It is used for performing experiments and conducting statistical tests between learning schemes.
• Knowledge Flow: It supports same functions as the Explorer but with a drag and drop interface.
• Simple CLI: This is a simple command-line interface that allows direct execution of WEKA commands for operating systems that do not provide their own command line interface.

3 Package Installation

Choose Tools -> Package Manager from Weka GUI Chooser. The package manager’s window is split horizontally into two parts: at thetop is a list of packages and at the bottom is a mini browser that can be usedto display information on the currently selected package.

4 Data Sets

Each entry in a dataset is an instance of the java class: weka.core.Instance.
4.1 ARFF
An ARFF (Attribute-Relation File Format) file is an ASCII text file thatdescribes a list of instances sharing a set of attributes. It consist of:
• header section is prefixed by @RELATION
• each attribute is indicated by @ATTRIBUTE
• data section is prefixed by @DATA
• data is comma separated, with the class as the last attribute
4.2 XRFF
The XRFF (Xml attribute Relation File Format) is a representing the data ina format that can store comments, attribute and instance weights.The following file extensions are recognized as XRFF files:
• .xrff: The default extension of XRFF files
• .xrff.gz: The extension for gzip compressed XRFF files
4.3 Converters
Weka contains converters for the following data sources:
• ARFF files (ArffLoader, ArffSaver)
• C4.5 files (C45Loader, C45Saver)
• CSV files (CSVLoader, CSVSaver)
• Files containing serialized instances (SerializedInstancesLoader, SerializedInstancesSaver)
• JDBC databases (DatabaseLoader, DatabaseSaver)
• libsvm files (LibSVMLoader, LibSVMSaver)
• XRFF files (XRFFLoader, XRFFSaver)
• Text directories for text mining (TextDirectoryLoader)
4.4 Databases
Weka comes with example files for the following databases:
• DatabaseUtils.props.hsql - HSQLDB (>= 3.4.1)
• DatabaseUtils.props.msaccess - MS Access (> 3.4.14, > 3.5.8, > 3.6.0)
• DatabaseUtils.props.mssqlserver- MS SQL Server 2000 (>= 3.4.9, >= 3.5.4)
• DatabaseUtils.props.mssqlserver2005 - MS SQL Server 2005 (>= 3.4.11, >= 3.5.6)
• DatabaseUtils.props.mysql - MySQL (>= 3.4.9, >= 3.5.4)
• DatabaseUtils.props.odbc - ODBC access via Sun’s ODBC/JDBC drivers
• DatabaseUtils.props.oracle - Oracle 10g (>= 3.4.9, >= 3.5.4)
• DatabaseUtils.props.postgresql - PostgreSQL 7.4 (>= 3.4.9, >= 3.5.4)
• DatabaseUtils.props.sqlite3 - sqlite 3.x (> 3.4.12, > 3.5.7)

5 Using the API

Weka provides an API that can be directly invoked from Java code. It can be used to embed machine learning algorithms in Java programs. The following are the steps required to embed a classifier:
Step 1: Express the problem with features
Step 2: Train a Classifier
Step 3: Test the classifier
Step 4: use the classifier

6 Weka Integration to Pentaho Data Integration

Weka can be easily integrated with the ETL tool Spoon using the Weka Scoring Plugin. The following steps are to be followed for the plugin installation:
1. The Weka scoring plugin can be downloaded from: http://wiki.pentaho.com/display/EAI/List+of+Available+Pentaho+Data+Integration+Plug-In
2. Unpack the plugin archive and copy all files in the WekaScoringDeploy directory to a sub-directory in the plugins/steps directory of the Kettle installation.
3. Copy the "weka.jar" file from the Weka installation folderto the same sub-directory in plugins/steps as before.

7 Pros and Cons of Weka

7.1 Advantages
• Open source
• Extensible
• Portable
• Relatively easier to use
• Large collection of Data Mining algorithms
7.2 Disadvantages
• Sequence modelling is not covered by the algorithms included in the Weka distribution
• Not capable of multi-relational data mining
• Memory bound

8 Projects based on Weka

There are many projects that extend or wrap WEKA. Some of these include:
• Systems for natural language processing: GATE is an NLP tool that uses Weka for natural language processing.
• Knowledge discovery in biology: BioWEKA is anextension to WEKA for tasks in biology
• Distributed and parallel data mining: There are a number of applications that use Weka for distributed data mining. Some of them include Weka- Parallel, Grid Weka, FAEHIM and Weka4WS.
• Open-source data mining systems: Many data mining systems provide plugins to access Weka’s algorithms. The R statistical computing environment provides an interface to Weka using RWeka package.
• Scientiﬁc workﬂow environment: The Kepler open- source scientific workflow platform is developed based on Weka.

9 Alternatives to Weka

The following are the main alternatives to Weka:

R is a powerful statistical programming language. It is derived from the S language which was developed by Bell Labs.
ELKI is a similar project to Weka with a focus on cluster analysis
KNIME is a machine learning and data mining software implemented in Java.
MOA is an open-source project for large scale mining of data streams, also developed at the University of Waikato in New Zealand.
Neural Designer is a data mining software based on deep learning techniques written in C++.
Orange is a similar open-source project for data mining, machine learning and visualization written in Python and C++.
RapidMiner is a commercial machine learning framework implemented in Java which integrates Weka.

One of the most challenging aspects of open source software is to decide what to include in the software. So the contributions to the software are controlled. This limits the community involvement. It can be managed by using packages. The package management system of Weka is the best example for this. The mailing list of open source software are easier to maintain if the users are researchers. Weka is developed and maintained by a team of researchers at Waikato University. One of the main advantage of using Weka is that it has been incorporated into many open source projects. Hence for a beginner in data mining, among the available open source projects Weka forms the best choice.

Head Up, Stay Strong, Fake a Smile, Move on.....

Sunday, 13 December 2015

Weka

With the advent of Search Engines and Social Media Sites, there is an explosion of data. Today’s age can be regarded as “We are drowning in data, but starving for knowledge”. Companies are spending millions to build data warehouses for storing the data. But most of the companies fail in getting the expected ROI from this data. Here comes the importance of data mining. Data mining is the process of gaining knowledge by analyzing the patterns and trends in the data. Different data mining tools such as R, Rapid Miner and Weka are used for this purpose. Weka stands for Waikato Environment for Knowledge Analysis. It is a statistical and data analysis tool written in Java. Weka was developed by a team of researchers at Waikato University in New Zealand. Weka is a collection of visualization tools and algorithms for data analysis. It supports most of the standard data mining tasks such as data preprocessing, clustering, classification, regression, visualization and feature selection. Weka is an open source data mining software and is available under GNU General Public License agreement. It was originally written in C and later was rewritten to Java. Hence it is compatible with all computing platforms. It also provides a GUI for ease of use. Weka works on the assumption that the data is available as flat file where the different attributes in the data set is fixed. The most stable version of Weka is 3.6.13 which was released on September 11, 2015.

Features of Weka

The following are the important features of Weka:

2.1.1 Open Source software

Weka is freely available under the GNU GPL. The source code of Weka is written in Java.

2.1.2 Designed for data analysis

It consists of a vast collection of algorithms for data mining and machine learning. Weka is kept up-to-date with new algorithms being added.

2.1.3 Ease of use

It is easily useable by people who are not data mining specialists

2.1.4 Platform independence

Weka is platform independent.

Functionalities provided by Weka

The following are the basic functionalities provided by Weka:

Data Preprocessing: Weka supports various data formats including the database connectivity using JDBC.
Classification:Weka includes more than 100 classification algorithms. Classifiers are divided into Bayesian methods(Naïve Bayes, Bayesian nets etc), lazy methods(nearest neighbor and variants), rule based methods(decision tables, OneR, RIPPER etc), tree learners(C4.5, Naïve Bayes, M5), function based learners(Linear Regression, SVM, Gaussian Process) and other miscellaneous methods.
Clustering: The various clustering algorithms implemented in Weka includes K-means, EM and other hierarchical methods.
Attribute selection: The classifier performance depends on the attributes selected. Various search methods and selection criteria are available for attribute selection.
Data visualization: Various visualization options include Tree viewer, Dendrogram viewer and Bayes Network Viewer.

More updates about Weka in the next post.

Nothing is softer than water, But its force can break the hardest rock.