Big Big Things in my Little Little World: 2015

Sunday, 20 December 2015

More about Weka

Getting Started

1 Weka Installation

Weka requires Java. We may already have Java installed and if not, there are versions of Weka listed on the download page that include Java. The latest version of Weka can be downloaded from http://www.cs.waikato.ac.nz/ml/weka/

2 User Interfaces

The GUI chooser in Weka consist of four options:
• Explorer: It is an environment for exploring data with WEKA.
• Experimenter: It is used for performing experiments and conducting statistical tests between learning schemes.
• Knowledge Flow: It supports same functions as the Explorer but with a drag and drop interface.
• Simple CLI: This is a simple command-line interface that allows direct execution of WEKA commands for operating systems that do not provide their own command line interface.

3 Package Installation

Choose Tools -> Package Manager from Weka GUI Chooser. The package manager’s window is split horizontally into two parts: at thetop is a list of packages and at the bottom is a mini browser that can be usedto display information on the currently selected package.

4 Data Sets

Each entry in a dataset is an instance of the java class: weka.core.Instance.
4.1 ARFF
An ARFF (Attribute-Relation File Format) file is an ASCII text file thatdescribes a list of instances sharing a set of attributes. It consist of:
• header section is prefixed by @RELATION
• each attribute is indicated by @ATTRIBUTE
• data section is prefixed by @DATA
• data is comma separated, with the class as the last attribute
4.2 XRFF
The XRFF (Xml attribute Relation File Format) is a representing the data ina format that can store comments, attribute and instance weights.The following file extensions are recognized as XRFF files:
• .xrff: The default extension of XRFF files
• .xrff.gz: The extension for gzip compressed XRFF files
4.3 Converters
Weka contains converters for the following data sources:
• ARFF files (ArffLoader, ArffSaver)
• C4.5 files (C45Loader, C45Saver)
• CSV files (CSVLoader, CSVSaver)
• Files containing serialized instances (SerializedInstancesLoader, SerializedInstancesSaver)
• JDBC databases (DatabaseLoader, DatabaseSaver)
• libsvm files (LibSVMLoader, LibSVMSaver)
• XRFF files (XRFFLoader, XRFFSaver)
• Text directories for text mining (TextDirectoryLoader)
4.4 Databases
Weka comes with example files for the following databases:
• DatabaseUtils.props.hsql - HSQLDB (>= 3.4.1)
• DatabaseUtils.props.msaccess - MS Access (> 3.4.14, > 3.5.8, > 3.6.0)
• DatabaseUtils.props.mssqlserver- MS SQL Server 2000 (>= 3.4.9, >= 3.5.4)
• DatabaseUtils.props.mssqlserver2005 - MS SQL Server 2005 (>= 3.4.11, >= 3.5.6)
• DatabaseUtils.props.mysql - MySQL (>= 3.4.9, >= 3.5.4)
• DatabaseUtils.props.odbc - ODBC access via Sun’s ODBC/JDBC drivers
• DatabaseUtils.props.oracle - Oracle 10g (>= 3.4.9, >= 3.5.4)
• DatabaseUtils.props.postgresql - PostgreSQL 7.4 (>= 3.4.9, >= 3.5.4)
• DatabaseUtils.props.sqlite3 - sqlite 3.x (> 3.4.12, > 3.5.7)

5 Using the API

Weka provides an API that can be directly invoked from Java code. It can be used to embed machine learning algorithms in Java programs. The following are the steps required to embed a classifier:
Step 1: Express the problem with features
Step 2: Train a Classifier
Step 3: Test the classifier
Step 4: use the classifier

6 Weka Integration to Pentaho Data Integration

Weka can be easily integrated with the ETL tool Spoon using the Weka Scoring Plugin. The following steps are to be followed for the plugin installation:
1. The Weka scoring plugin can be downloaded from: http://wiki.pentaho.com/display/EAI/List+of+Available+Pentaho+Data+Integration+Plug-In
2. Unpack the plugin archive and copy all files in the WekaScoringDeploy directory to a sub-directory in the plugins/steps directory of the Kettle installation.
3. Copy the "weka.jar" file from the Weka installation folderto the same sub-directory in plugins/steps as before.

7 Pros and Cons of Weka

7.1 Advantages
• Open source
• Extensible
• Portable
• Relatively easier to use
• Large collection of Data Mining algorithms
7.2 Disadvantages
• Sequence modelling is not covered by the algorithms included in the Weka distribution
• Not capable of multi-relational data mining
• Memory bound

8 Projects based on Weka

There are many projects that extend or wrap WEKA. Some of these include:
• Systems for natural language processing: GATE is an NLP tool that uses Weka for natural language processing.
• Knowledge discovery in biology: BioWEKA is anextension to WEKA for tasks in biology
• Distributed and parallel data mining: There are a number of applications that use Weka for distributed data mining. Some of them include Weka- Parallel, Grid Weka, FAEHIM and Weka4WS.
• Open-source data mining systems: Many data mining systems provide plugins to access Weka’s algorithms. The R statistical computing environment provides an interface to Weka using RWeka package.
• Scientiﬁc workﬂow environment: The Kepler open- source scientific workflow platform is developed based on Weka.

9 Alternatives to Weka

The following are the main alternatives to Weka:

R is a powerful statistical programming language. It is derived from the S language which was developed by Bell Labs.
ELKI is a similar project to Weka with a focus on cluster analysis
KNIME is a machine learning and data mining software implemented in Java.
MOA is an open-source project for large scale mining of data streams, also developed at the University of Waikato in New Zealand.
Neural Designer is a data mining software based on deep learning techniques written in C++.
Orange is a similar open-source project for data mining, machine learning and visualization written in Python and C++.
RapidMiner is a commercial machine learning framework implemented in Java which integrates Weka.

One of the most challenging aspects of open source software is to decide what to include in the software. So the contributions to the software are controlled. This limits the community involvement. It can be managed by using packages. The package management system of Weka is the best example for this. The mailing list of open source software are easier to maintain if the users are researchers. Weka is developed and maintained by a team of researchers at Waikato University. One of the main advantage of using Weka is that it has been incorporated into many open source projects. Hence for a beginner in data mining, among the available open source projects Weka forms the best choice.

Head Up, Stay Strong, Fake a Smile, Move on.....

Sunday, 13 December 2015

Weka

With the advent of Search Engines and Social Media Sites, there is an explosion of data. Today’s age can be regarded as “We are drowning in data, but starving for knowledge”. Companies are spending millions to build data warehouses for storing the data. But most of the companies fail in getting the expected ROI from this data. Here comes the importance of data mining. Data mining is the process of gaining knowledge by analyzing the patterns and trends in the data. Different data mining tools such as R, Rapid Miner and Weka are used for this purpose. Weka stands for Waikato Environment for Knowledge Analysis. It is a statistical and data analysis tool written in Java. Weka was developed by a team of researchers at Waikato University in New Zealand. Weka is a collection of visualization tools and algorithms for data analysis. It supports most of the standard data mining tasks such as data preprocessing, clustering, classification, regression, visualization and feature selection. Weka is an open source data mining software and is available under GNU General Public License agreement. It was originally written in C and later was rewritten to Java. Hence it is compatible with all computing platforms. It also provides a GUI for ease of use. Weka works on the assumption that the data is available as flat file where the different attributes in the data set is fixed. The most stable version of Weka is 3.6.13 which was released on September 11, 2015.

Features of Weka

The following are the important features of Weka:

2.1.1 Open Source software

Weka is freely available under the GNU GPL. The source code of Weka is written in Java.

2.1.2 Designed for data analysis

It consists of a vast collection of algorithms for data mining and machine learning. Weka is kept up-to-date with new algorithms being added.

2.1.3 Ease of use

It is easily useable by people who are not data mining specialists

2.1.4 Platform independence

Weka is platform independent.

Functionalities provided by Weka

The following are the basic functionalities provided by Weka:

Data Preprocessing: Weka supports various data formats including the database connectivity using JDBC.
Classification:Weka includes more than 100 classification algorithms. Classifiers are divided into Bayesian methods(Naïve Bayes, Bayesian nets etc), lazy methods(nearest neighbor and variants), rule based methods(decision tables, OneR, RIPPER etc), tree learners(C4.5, Naïve Bayes, M5), function based learners(Linear Regression, SVM, Gaussian Process) and other miscellaneous methods.
Clustering: The various clustering algorithms implemented in Weka includes K-means, EM and other hierarchical methods.
Attribute selection: The classifier performance depends on the attributes selected. Various search methods and selection criteria are available for attribute selection.
Data visualization: Various visualization options include Tree viewer, Dendrogram viewer and Bayes Network Viewer.

More updates about Weka in the next post.

Nothing is softer than water, But its force can break the hardest rock.

Sunday, 11 October 2015

Data Visualization

Information in an unorganized form is called data. Data can come from diverse sources such as social media, sensors, transaction logs etc. Tables or text files are used to store the data. But it is not possible to understand the data in these different formats. It is proven that human brain understands visuals rather than facts and figures. The data visualization tools are used to understand the data. Wide variety of visualization tools are available in the market. Some of them include Actuate, QlikView, Spotfire, Google Chart API, Flot, Raphael etc. Besides the common functionalities each tool provides its own features. So the choice of a tool depends on the context in which it is used. Data consist of raw facts and figures. Data visualization is the process of representing the data in the form of charts, maps or any other graphical means which makes the content easier to understand. The first data visualization was created by Rene Descartes using the Cartesian co-ordinate system in the 17th century. With the advent of the Social media sites like Facebook and Twitter the amount of data collected, stored and analyzed have increased significantly. Hence data analysis has gained importance. Friend maps and Twitter Vision are the data visualizations familiar to such users. Graphical representations are more helpful than the Excel files or tables containing the data since they help to think about the data by revealing the underlying patterns and connections between different elements. Data Visualization tools enable users to quickly create complex visualizations using data from diverse data sources. Some of the leading data visualization tools include Tableau, QlikView and Actuate.

Functions of Data Visualization Tools

The following are the main functions of data visualization tools:

Minimization of effort: The data can be analyzed quickly by connecting it to different sources using the Drag and Drop functionality. This reduces the Lines of Code written by the developers.
Framing Questions: Data visualizations help in identifying outliers in the data. This leads to identifying the problems in the data.
Answering Questions: The findings in the visualization can be used identify the trends. These can be used to predict the future observations.

Requirements for data visualization

The requirements of Data visualization can be classified into following:

Functional Requirements

Functional requirements includes the set of activities that the system should do. It includes the following:

Support real time creation of dynamic and interactive charts
Allow interaction of multiple users with the data across diverse platforms
Ability to visualize data from different data sources
Provide secure access of data by end users

Non Functional Requirements

Non functional requirements are those requirements that are not part of the functional requirements. They are mainly used to judge the performance of the system. The set of non functional requirements consist of Performance, Scalability, data Integrity etc.

Stages of Data Visualization

Benjamin Fry is an American expert in Data visualization. He has proposed seven stages in data visualization. Each of the stages can be briefly explained as follows:

Acquire: The data must be retrieved from a data source.
Parse: It is not necessary that the data obtained will be in a format suitable for visualization. Hence the data must be structured into categories.
Filter: The unimportant data must be removed to prevent information overload.
Mine: Different statistical methods can be applied to identify the trends and patterns in the data.
Represent: Different views and representations of the data leads to better decision making.
Refine: The basic visual model chosen will be further refined to make the representation clearer and visually intuitive.
Interact: Add different methods of interaction to allow users to decide what they see and how they see.

Data Visualization Tools

Some of the leading data visualization tools are the following:

Actuate

The Actuate Data Visualization Suite consist of the BIRT Analytics, BIRT Designer and BIRT iHub Runtime and Viewer.

BIRT Analytics: It is a visual data mining and predictive analytics tool. The main features include:

Social- It can connect to both social and web data sources including Facebook, Twitter and Google Analytics.
Predictive- It incorporates both predictive analysis and visual data mining in a single product.
Quick Big Data- It can analyze huge volume of data within short span of time.

BIRT Designer: It is an open source reporting software based on the Eclipse IDE. BIRT Designer is used by developers to create visualizations based on the data from different data sources. It has the following characteristics:

- Data Integration from diverse data sources
- Consist of tools to secure, filter, format and present dynamic reports to end user
- BIRT Designer includes a set of component libraries

BIRT iHub Runtime and Viewer: It is the deployment platform for all the BIRT content. It includes the following functionalities:

- It consist of data drivers to data sources such as Hadoop and Oracle
- Publishes BIRT content to web, mobile and other print media
- Controls the access to the BIRT content

QlikView

QlikView is a software that helps users to retrieve and analyze data easily from any source. It offers wide variety of charts, tables etc. for representing the data. The QlikView stack of products include QlikView Personal Edition, QlikView Server and QlikView Publisher.

QlikView Personal Edition: It provides full QlikView functionality, but it is not possible to open documents created by other users. To do this we need a QlikView license. QlikView Personal edition can be downloaded as a standalone application.
QlikView Server: QlikView information can be shared and hosted using the QlikView Server platform.
QlikView Publisher: It manages the content and access. QlikView Publisher distributes data stored in QlikView documents to end users.

Tableau

Tableau helps to drag and drop data to visualize it. It consist of Tableau Desktop, Tableau Server, Tableau Online, Tableau Public and Tableau Reader.

Tableau Desktop is a standalone desktop application
Tableau Server is a browser based business intelligence solution.
Tableau Online is a hosted version of tableau server.
Tableau Public is a service used for interactive data visualization.
Tableau Reader is a free desktop application used to view the visualizations built in tableau desktop.

With the advent of Social media sites and search engines, large amount of data is produced daily. The urge for data analysis is increasing. So it is high time to analyze raw data and present the information to the end user in an intuitive way. Besides the wide variety of tools available, the evaluation of the nonfunctional requirements is done for Actuate, QlikView and Tableau. The most important feature of the Actuate is its Live Excel functionality which helps the data to be exported as pivot tables. QlikView has an intuitive user interface but the implementation time is high compared to Tableau. Clearly each tool has its own USP and many of their NFR attributes complement each other.

Choose a job you love, and you will never have to work a day in your life.

Monday, 31 August 2015

More about Sentiment Analysis

Method of Study

A corpus is a collection of documents for analysis. The current evaluation was done on the mail corpus. A set of E-mails expressing different sentiments were used for this purpose. The main feature of these emails is that most of them express negative opinions. The analysis was carried out by the following steps:

Document Splitting

A mail corpus may contain a single mail or a chain of emails. So the initial step is to identify the characteristic of the mail corpus. The chain of emails is split into individual mails for analysis.

Sentence splitting

Document is a collection of sentences. So in order to find the opinion of the speaker all the sentences in a document is analyzed. Hence the given mail is splitted into sentences and each sentence is fed to the sentiment analysis API for analysis.

Data transfer to API

Appropriate JAR files or sentiment analysis lexicons are downloaded and installed for the evaluating the open source tools. Apart from the open source APIs all the commercial APIs are available as REST services. It is required to register for free API keys for their evaluation. So in the case of the commercial APIs the appropriately formatted text is fed as input to the web based service.

JSON parsing

All APIs provide the result of sentiment analysis in either XML or JSON format. So appropriate JSON parsers are required to extract the type of sentiment and score from these responses. The Jackson JSON parsing API was used to parse the responses.

Extracting the score and type of sentiment

The JSON response from the API is extracted and the sentiment type and score is displayed as output.

Platform used for study

The analysis was done based on the Java based APIs of the different tools.

Key Findings

Longer texts are hard to classify

The lexical base classification does not work well in the case of lengthy sentences. This is because of the fact that subjective words may not be present in such texts but they may be expressing strong opinions.

Results depend on the training set used in the API

Most of the sentiment analysis tools will be formulated based on specific training corpuses. For example the Sentiwordnet sentiment lexicon was formulated based on the movie reviews dataset. Hence it performs best on the movie reviews.

Future Works

The above approach of evaluation can be further refined by incorporating the following features:

Real time Sentiment Analysis

Current approach used consist of analysing all the mails stored in a folder in the machine. This can be modified so that the sentiment analysis is done on real time basis. Hence as soon as a mail enters the inbox, its sentiment is analysed on the fly.

Culture based sentiment analysis

This approach extracts semantically hidden concepts from mails and incorporates them into supervised classifier training by interpolation. The interpolation method works by interpolating the original unigram language model in Naive Bayes classifier with the generative model of words given semantic concept. Cultural features can be incorporated in similar way where the unigram language model will also be interpolated by the generative model of users given cultural features.

This comparison study focused on detecting the polarity of content, like positive and negative effects and does not consider other types of sentiments such as anger or calmness. Only a few of the methods are able to reach somewhat high level of accuracy. Each of the evaluated tools are trained on specific corpuses. Hence the results of analysis depends on the training set used for model formulation in the tool used for sentiment analysis. Thus, the sentiment analysis tools still have a long way to go before reaching the confidence level demanded by practical applications.

Action without knowledge is wastage and Knowledge without action is futile!!

Wednesday, 5 August 2015

Tools used for Sentiment Analysis

1.1 AlchemyAPI

AlchemyAPI consists of both Linguistic and statistical analysis. It was formulated based on the tweets. The linguistic analysis consist of identifying the phrases and how these phrases combine to form sentences. Statistical analysis consist of using mathematical techniques for text analysis. The AlchemyAPI consist of more than 30,000 users. AlchemyAPI Sentiment Analysis APIs are capable of computing document-level sentiment, user-specified sentiment targeting, entity-level sentiment, emoticons and keyword-level sentiment. AlchemyAPI can be easily used with any major programming language: Java, C/C++, C#, Perl, PHP, Python, Ruby, Javascript and Android OS. AlchemyAPI uses REST interface to access the different text algorithms. It can process content as plain text or HTML, and you can use URLs for web-accessible content or raw HTML for private web documents. Most of the functions work with 8 languages: English, German, French, Italian, Portuguese, Russian, Spanish and Swedish. AlchemyAPI is a paid service but it also offers a free API key to get started with 1000 calls per day.

1.2 SentiWordNet

The automatic annotation of all synsets in the wordnet has given rise to SentiWordNet. Four versions of SentiWordNet are available namely: SentiWordNet 1.0, SentiWordNet 1.1, SentiWordNet 2.0 and SentiWordNet 3.0. SentiWordNet 1.0 was based on the concept of bag of words. SentiWordNet 3.0 is widely used. It is freely distributed for noncommercial use, and licensed are available for commercial applications. In SentiWordNet the degree of positivity or negativity ranges from 0 to 1. SentiWordNet was developed by ranking the synsets according to the PoS. The parts of speech represented by the SentiWordNet are adjective, noun, adverb and verb which are represented respectively as 'a', 'n', 'r', 'v'. The database has five columns, the part of speech, the offset, positive score, negative score and synset terms that includes all terms belonging to a particular synset. Offset is a numerical ID, that when matched with a particular part of speech, identifies a synset. The SentiWordNet lexical database was formulated based on the movie review dataset.

Fields	Descriptions
POS	Parts Of Speech linked with synset. This can take four possible values: a- Adjective v- Verb n- Noun r- Adverb
Offset	Numerical ID which associated with part of speech uniquely Identifies a synset in the Database.
PosScore	Positive score for this synset. This is a numerical value ranging from 0 to 1
NegScore	Negative score for this synset. This is a numerical value ranging from 0 to 1.
Synset Terms	List of all terms included in this synset.

Table 2: SentiWordNet Database structure

POS	Offset	PosScore	NegScore	SynsetTerms
a	1740	0.125	0	Able#1
a	2098	0	0.75	Unable#1
n	388959	0	0	divarication#1
n	389043	0	0	fibrillation#2
r	76948	0.625	0	brazenly#1
r	77042	0.125	0.5	brilliantly#2
v	1827745	0	0	slobber_over#1
v	1827858	0.625	0.125	look_up_to#1

Table 3: Sentiment scores associated to SentiWordNet entries

1.3 Stanford NLP

Stanford NLP is the Java suite of NLP tools developed by the University of Stanford. It consist of a stack of products including Stanford CoreNLP, Stanford Parser, Stanford POS Tagger, Stanford Named Entity Recognizer, Stanford Word Segmenter etc. The movie review dataset was used for training the model in Stanford NLP. In Stanford NLP the raw text is put into an Annotation object and then a sequence of Annotators add information in an analysis Pipeline. The resulting Annotation, containing all the analysis information added by the Annotators, can be output in XML or plain text forms. The results of Stanford NLP can be accessed in two ways: The first method involves the conversion of annotation object to XML and is written to a file. The second method involves printing the code that gets a particular type of information out of an Annotation. Stanford NLP can be accessed easily from many languages, including Python, Ruby, Perl, Scala, Clojure, Javascript (node.js), and .NET.

The execution flow of Stanford NLP consist of the following phases:

Tokenization: It is the process of chopping a sequence of characters into pieces called tokens.
Sentence Splitting: ssplit property splits a sequence of tokens into sentences.
Part-of-speech Tagging: pos property labels tokens with their POS tags
Morphological Analysis: Morphological Analysis is the process of providing grammatical information of a word given its suffix. The smallest unit in morphological analysis is the morpheme.
Named Entity Recognition: The “ner” property recognizes named (PERSON, LOCATION, ORGANIZATION, MISC), numerical (MONEY, NUMBER, PERCENT), and temporal (DATE, TIME, DURATION, SET) entities from a given text.
Syntactic Parsing: It mainly deals with the grammatical structure of sentences. It consist of identifying phrases, subject or object of a verb.
Coreference Resolution: Coreference means that multiple expressions in a sentence or document refer the same thing. E.g. consider the sentence “John drove to Judy’s house. He made her dinner.” In this example both “John” and “He” refer to the same entity (John); and “Judy “and “her “refer to the entity (Judy).
Annotators: The backbone of the CoreNLP package is formed by two classes: Annotation and Annotator. Annotations are the data structure which hold the results of annotators. Annotations are basically maps, from keys to bits of the annotation, such as the parse, the part-of-speech tags, or named entity tags. Annotators tokenize, parse, or NER tag sentences. Annotators and Annotations are integrated by AnnotationPipelines, which create sequences of generic Annotators. Stanford CoreNLP inherits from the AnnotationPipeline class, and is customized with NLP Annotators.

1.4 viralheat API

viralheat API is used to infer the sentiment of a given piece of text. The free account of viralheat API can handle 1000 requests per day and accepts only 360 characters per request.

Just wait for more updates in the next post…

To succeed in your mission, you must have single-minded devotion to your goal.

Saturday, 18 July 2015

Sentiment Analysis

Now a days, people tend to spend more time on social media platforms such as Facebook, Twitter etc. The interactions in social media leave a trail of huge amount of data. Major portion of this data is in textual form. The art of opinion mining is termed as sentiment analysis. It involves the classification of the text into Positive, Negative and Neutral based on the polarity. It also includes determining the attitude of speaker with respect to the topic. Knowing whether the trending tweets about the product are positive or negative helps in identifying areas of improvement for the company.The usage of social media sites is increasing day by day. Hence huge amount of textual data is generated. This data can be used to analyze the opinion of customers in social media. Thus the reputation of a product in social media can be analyzed and it can also be used to generate offers based on the customer preferences. Here comes the importance of sentiment analysis. It involves the process of extracting opinion of the speaker from plain text. This can be also termed as polarity detection. Based on the opinion of the speaker, the text can be classified as Positive, Negative and neutral. Different tools are used for the analysis of customer sentiment. Sentiment Analysis is done using different tools. The tools can be either open source or commercial. AlchemyAPI, SentiWordNet, Stanford NLP, viralheat API, Sentimatrix and python NLTK are some among them.

Types of Sentiment Analysis

Based on the algorithms used sentiment analysis can be classified into different categories. The classification can be either based on the polarity detection method or the structure of the text analyzed.

1. Classification based on the polarity detection method

Supervised- It is a machine learning technique in which a classifier is trained based on a feature set.
Unsupervised- In the unsupervised method a sentiment lexicon is used to detect the polarity of the given text.
Hybrid- A combination of supervised and unsupervised methods form the hybrid method.

2. Classification based on the structure of the text

Document level- It aims to find the sentiment for the whole document
Sentence level- Here a document is split into sentences and the opinion is analysed for each sentence.
Word level- The opinion mining is done for each word in a sentence.

Steps in Sentiment Analysis

The process of sentiment analysis involves four steps namely:

1. Pre- processing and breaking the text into parts of speech

This involves the following steps:

POS tagging: It is the process of assigning parts of speech such as noun, verb and adjective to each word in a text. It is done based on treebanks. Treebank is a parsed text corpus that annotates syntactic or semantic sentence structure. Penn Treebank is normally used for this purpose. Eg: “This is a sample sentence” will be output as “This/DT is/VBZ a/DT sample/NN sentence/NN” where DT is the determiner, VBZ is the Verb, 3rd person singular present and NN stands for singular noun.

Number	Tag	Description
1.	CC	Coordinating conjunction
2.	CD	Cardinal number
3.	DT	Determiner
4.	EX	Existential there
5.	FW	Foreign word
6.	IN	Preposition or subordinating conjunction
7.	JJ	Adjective
8.	JJR	Adjective, comparative
9.	JJS	Adjective, superlative
10.	LS	List item marker
11.	MD	Modal
12.	NN	Noun, singular or mass
13.	NNS	Noun, plural
14.	NNP	Proper noun, singular
15.	NNPS	Proper noun, plural
16.	PDT	Predeterminer
17.	POS	Possessive ending
18.	PRP	Personal pronoun
19.	PRP$	Possessive pronoun
20.	RB	Adverb
21.	RBR	Adverb, comparative
22.	RBS	Adverb, superlative
23.	RP	Particle
24.	SYM	Symbol
25.	TO	to
26.	UH	Interjection
27.	VB	Verb, base form
28.	VBD	Verb, past tense
29.	VBG	Verb, gerund or present participle
30.	VBN	Verb, past participle
31.	VBP	Verb, non-3rd person singular present
32.	VBZ	Verb, 3rd person singular present
33.	WDT	Wh-determiner
34.	WP	Wh-pronoun
35.	WP$	Possessive wh-pronoun
36.	WRB	Wh-adverb

Table 1: Alphabetical list of part-of-speech tags used in the Penn Treebank Project

Chunking: The process of identifying phrases is called chunking. Eg: “All the humans are cardboard cliches in this film”. This can be divided into two noun chunks such as: “All the humans” and “cardboard cliche in this film”.

Named Entity Recognition: It involves finding named entities such as names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc. E.g. “Jim bought 300 shares of Acme Corp. in 2006”. This can be parsed as “[Jim]_Person bought 300 shares of [Acme Corp.]_Organization in [2006]_Time”.

2. Subjective/ Objective classification

The different sentences in a document are classified into subjective and objective sentences. Fact based sentences are called objective sentences. E.g. “47% of Americans pay no federal income tax”. Subjective sentences consist of personal opinions, interpretations, points of view etc. E.g. “Spanish is difficult”.

3. Polarity classification

The subjective sentences are classified into positive, negative or neutral based on the opinion of the speaker in the sentence. Calculating the polarity of each sentence is very important to determine the overall sentiment. The different features of SVM classifier is mainly used for this purpose. It includes:

Bag of Words: Here a text is represented as a bag or multiset of words by disregarding the grammar while preserving the multiplicity. E.g. consider the following two text documents:

John likes to watch movies. Mary likes movies too.

John also likes to watch football games.

Based on these two text documents, a dictionary is constructed as:

{

"John": 1,

"likes": 2,

"to": 3,

"watch": 4,

"movies": 5,

"also": 6,

"football": 7,

"games": 8,

"Mary": 9,

"too": 10

}

Which has 10 distinct words. And using the indexes of the dictionary, each document is represented by a 10-entry vector:

[1, 2, 1, 1, 2, 0, 0, 0, 1, 1]

[1, 1, 1, 1, 0, 1, 1, 1, 0, 0]

Here each entry of the vectors refers to count of the corresponding entry in the dictionary. The first vector represents document 1 while the second vector represents document 2. In the first vector the first two entries are "1, 2". The first entry corresponds to the word "John" which is the first word in the dictionary, and its value is "1" because "John" appears in the first document 1 time. Similarly, the second entry corresponds to the word "likes" which is the second word in the dictionary, and its value is "2" because "likes" appears in the first document 2 times.

Negation Handling: Negation plays an important role in polarity analysis. E.g. “This is not a good movie” had the opposite polarity from the sentence “This is a good movie”, although the features of the original model would show that they were of the same polarity. So in order to handle the word “good” in first and second sentences differently, polarity is added to the word. Hence the sentence is interpreted as expressing negative opinion.

4. Sentiment Aggregation

The main task of this step is to aggregate the overall sentiment of the document from the sentences which were tagged positive and negative in polarity classification.

Applications of Sentiment Analysis

Product recommendations: Opinion mining can be used to provide recommendations to customers based on the Word of mouth.
Reputation analysis in social media: The public opinion regarding a product or service can be analyzed by text mining.

Challenges in Sentiment Analysis

o Named Entity Recognition – In some sentences it is difficult to find the topic on which the author speaks. E.g. Is 300 Spartans a group of Greeks or a movie?

o Anaphora Resolution – It is the problem of resolving what a pronoun, or a noun phrase refers to. E.g. "We watched the movie and went to dinner; it was awful." What does "It" refer to?

o Parsing – This deals with finding what is the subject and object of the sentence or which one does the verb and/or adjective actually refer to?

o Sarcasm - If we don't know the author we won’t be having any idea whether 'bad' means bad or good.

o Texts from Social media sites – Ungrammatical sentences, abbreviations, lack of capitals, poor spelling, poor punctuation, poor grammar occurs commonly in social media posts.

o Detecting in depth sentiment/emotion- Positive and negative is a very simple analysis, one of the challenge is how to extract emotions like how much hate there is inside the opinion, how much happiness, how much sadness, etc.

o Finding the object for which the opinion is expressed- For example, if you say "She won him!” this means a positive sentiment for her and a negative sentiment for him, at the same time.

o Analysis of very subjective sentences or paragraphs- Sometimes even for humans it is very hard to agree on the sentiment of this high subjective texts. Imagine for a computer

Difficult Roads Lead To Beautiful Destinations!!!!!