Big Big Things in my Little Little World: August 2015

Method of Study

A corpus is a collection of documents for analysis. The current evaluation was done on the mail corpus. A set of E-mails expressing different sentiments were used for this purpose. The main feature of these emails is that most of them express negative opinions. The analysis was carried out by the following steps:

Document Splitting

A mail corpus may contain a single mail or a chain of emails. So the initial step is to identify the characteristic of the mail corpus. The chain of emails is split into individual mails for analysis.

Sentence splitting

Document is a collection of sentences. So in order to find the opinion of the speaker all the sentences in a document is analyzed. Hence the given mail is splitted into sentences and each sentence is fed to the sentiment analysis API for analysis.

Data transfer to API

Appropriate JAR files or sentiment analysis lexicons are downloaded and installed for the evaluating the open source tools. Apart from the open source APIs all the commercial APIs are available as REST services. It is required to register for free API keys for their evaluation. So in the case of the commercial APIs the appropriately formatted text is fed as input to the web based service.

JSON parsing

All APIs provide the result of sentiment analysis in either XML or JSON format. So appropriate JSON parsers are required to extract the type of sentiment and score from these responses. The Jackson JSON parsing API was used to parse the responses.

Extracting the score and type of sentiment

The JSON response from the API is extracted and the sentiment type and score is displayed as output.

Platform used for study

The analysis was done based on the Java based APIs of the different tools.

Key Findings

Longer texts are hard to classify

The lexical base classification does not work well in the case of lengthy sentences. This is because of the fact that subjective words may not be present in such texts but they may be expressing strong opinions.

Results depend on the training set used in the API

Most of the sentiment analysis tools will be formulated based on specific training corpuses. For example the Sentiwordnet sentiment lexicon was formulated based on the movie reviews dataset. Hence it performs best on the movie reviews.

Future Works

The above approach of evaluation can be further refined by incorporating the following features:

Real time Sentiment Analysis

Current approach used consist of analysing all the mails stored in a folder in the machine. This can be modified so that the sentiment analysis is done on real time basis. Hence as soon as a mail enters the inbox, its sentiment is analysed on the fly.

Culture based sentiment analysis

This approach extracts semantically hidden concepts from mails and incorporates them into supervised classifier training by interpolation. The interpolation method works by interpolating the original unigram language model in Naive Bayes classifier with the generative model of words given semantic concept. Cultural features can be incorporated in similar way where the unigram language model will also be interpolated by the generative model of users given cultural features.

This comparison study focused on detecting the polarity of content, like positive and negative effects and does not consider other types of sentiments such as anger or calmness. Only a few of the methods are able to reach somewhat high level of accuracy. Each of the evaluated tools are trained on specific corpuses. Hence the results of analysis depends on the training set used for model formulation in the tool used for sentiment analysis. Thus, the sentiment analysis tools still have a long way to go before reaching the confidence level demanded by practical applications.

Action without knowledge is wastage and Knowledge without action is futile!!

1.1 AlchemyAPI

AlchemyAPI consists of both Linguistic and statistical analysis. It was formulated based on the tweets. The linguistic analysis consist of identifying the phrases and how these phrases combine to form sentences. Statistical analysis consist of using mathematical techniques for text analysis. The AlchemyAPI consist of more than 30,000 users. AlchemyAPI Sentiment Analysis APIs are capable of computing document-level sentiment, user-specified sentiment targeting, entity-level sentiment, emoticons and keyword-level sentiment. AlchemyAPI can be easily used with any major programming language: Java, C/C++, C#, Perl, PHP, Python, Ruby, Javascript and Android OS. AlchemyAPI uses REST interface to access the different text algorithms. It can process content as plain text or HTML, and you can use URLs for web-accessible content or raw HTML for private web documents. Most of the functions work with 8 languages: English, German, French, Italian, Portuguese, Russian, Spanish and Swedish. AlchemyAPI is a paid service but it also offers a free API key to get started with 1000 calls per day.

1.2 SentiWordNet

The automatic annotation of all synsets in the wordnet has given rise to SentiWordNet. Four versions of SentiWordNet are available namely: SentiWordNet 1.0, SentiWordNet 1.1, SentiWordNet 2.0 and SentiWordNet 3.0. SentiWordNet 1.0 was based on the concept of bag of words. SentiWordNet 3.0 is widely used. It is freely distributed for noncommercial use, and licensed are available for commercial applications. In SentiWordNet the degree of positivity or negativity ranges from 0 to 1. SentiWordNet was developed by ranking the synsets according to the PoS. The parts of speech represented by the SentiWordNet are adjective, noun, adverb and verb which are represented respectively as 'a', 'n', 'r', 'v'. The database has five columns, the part of speech, the offset, positive score, negative score and synset terms that includes all terms belonging to a particular synset. Offset is a numerical ID, that when matched with a particular part of speech, identifies a synset. The SentiWordNet lexical database was formulated based on the movie review dataset.

Fields	Descriptions
POS	Parts Of Speech linked with synset. This can take four possible values: a- Adjective v- Verb n- Noun r- Adverb
Offset	Numerical ID which associated with part of speech uniquely Identifies a synset in the Database.
PosScore	Positive score for this synset. This is a numerical value ranging from 0 to 1
NegScore	Negative score for this synset. This is a numerical value ranging from 0 to 1.
Synset Terms	List of all terms included in this synset.

Table 2: SentiWordNet Database structure

POS	Offset	PosScore	NegScore	SynsetTerms
a	1740	0.125	0	Able#1
a	2098	0	0.75	Unable#1
n	388959	0	0	divarication#1
n	389043	0	0	fibrillation#2
r	76948	0.625	0	brazenly#1
r	77042	0.125	0.5	brilliantly#2
v	1827745	0	0	slobber_over#1
v	1827858	0.625	0.125	look_up_to#1

Table 3: Sentiment scores associated to SentiWordNet entries

1.3 Stanford NLP

Stanford NLP is the Java suite of NLP tools developed by the University of Stanford. It consist of a stack of products including Stanford CoreNLP, Stanford Parser, Stanford POS Tagger, Stanford Named Entity Recognizer, Stanford Word Segmenter etc. The movie review dataset was used for training the model in Stanford NLP. In Stanford NLP the raw text is put into an Annotation object and then a sequence of Annotators add information in an analysis Pipeline. The resulting Annotation, containing all the analysis information added by the Annotators, can be output in XML or plain text forms. The results of Stanford NLP can be accessed in two ways: The first method involves the conversion of annotation object to XML and is written to a file. The second method involves printing the code that gets a particular type of information out of an Annotation. Stanford NLP can be accessed easily from many languages, including Python, Ruby, Perl, Scala, Clojure, Javascript (node.js), and .NET.

The execution flow of Stanford NLP consist of the following phases:

Tokenization: It is the process of chopping a sequence of characters into pieces called tokens.
Sentence Splitting: ssplit property splits a sequence of tokens into sentences.
Part-of-speech Tagging: pos property labels tokens with their POS tags
Morphological Analysis: Morphological Analysis is the process of providing grammatical information of a word given its suffix. The smallest unit in morphological analysis is the morpheme.
Named Entity Recognition: The “ner” property recognizes named (PERSON, LOCATION, ORGANIZATION, MISC), numerical (MONEY, NUMBER, PERCENT), and temporal (DATE, TIME, DURATION, SET) entities from a given text.
Syntactic Parsing: It mainly deals with the grammatical structure of sentences. It consist of identifying phrases, subject or object of a verb.
Coreference Resolution: Coreference means that multiple expressions in a sentence or document refer the same thing. E.g. consider the sentence “John drove to Judy’s house. He made her dinner.” In this example both “John” and “He” refer to the same entity (John); and “Judy “and “her “refer to the entity (Judy).
Annotators: The backbone of the CoreNLP package is formed by two classes: Annotation and Annotator. Annotations are the data structure which hold the results of annotators. Annotations are basically maps, from keys to bits of the annotation, such as the parse, the part-of-speech tags, or named entity tags. Annotators tokenize, parse, or NER tag sentences. Annotators and Annotations are integrated by AnnotationPipelines, which create sequences of generic Annotators. Stanford CoreNLP inherits from the AnnotationPipeline class, and is customized with NLP Annotators.

1.4 viralheat API

viralheat API is used to infer the sentiment of a given piece of text. The free account of viralheat API can handle 1000 requests per day and accepts only 360 characters per request.

Just wait for more updates in the next post…

To succeed in your mission, you must have single-minded devotion to your goal.

Big Big Things in my Little Little World

Monday 31 August 2015

More about Sentiment Analysis