Big Big Things in my Little Little World: More about Sentiment Analysis

Method of Study

A corpus is a collection of documents for analysis. The current evaluation was done on the mail corpus. A set of E-mails expressing different sentiments were used for this purpose. The main feature of these emails is that most of them express negative opinions. The analysis was carried out by the following steps:

Document Splitting

A mail corpus may contain a single mail or a chain of emails. So the initial step is to identify the characteristic of the mail corpus. The chain of emails is split into individual mails for analysis.

Sentence splitting

Document is a collection of sentences. So in order to find the opinion of the speaker all the sentences in a document is analyzed. Hence the given mail is splitted into sentences and each sentence is fed to the sentiment analysis API for analysis.

Data transfer to API

Appropriate JAR files or sentiment analysis lexicons are downloaded and installed for the evaluating the open source tools. Apart from the open source APIs all the commercial APIs are available as REST services. It is required to register for free API keys for their evaluation. So in the case of the commercial APIs the appropriately formatted text is fed as input to the web based service.

JSON parsing

All APIs provide the result of sentiment analysis in either XML or JSON format. So appropriate JSON parsers are required to extract the type of sentiment and score from these responses. The Jackson JSON parsing API was used to parse the responses.

Extracting the score and type of sentiment

The JSON response from the API is extracted and the sentiment type and score is displayed as output.

Platform used for study

The analysis was done based on the Java based APIs of the different tools.

Key Findings

Longer texts are hard to classify

The lexical base classification does not work well in the case of lengthy sentences. This is because of the fact that subjective words may not be present in such texts but they may be expressing strong opinions.

Results depend on the training set used in the API

Most of the sentiment analysis tools will be formulated based on specific training corpuses. For example the Sentiwordnet sentiment lexicon was formulated based on the movie reviews dataset. Hence it performs best on the movie reviews.

Future Works

The above approach of evaluation can be further refined by incorporating the following features:

Real time Sentiment Analysis

Current approach used consist of analysing all the mails stored in a folder in the machine. This can be modified so that the sentiment analysis is done on real time basis. Hence as soon as a mail enters the inbox, its sentiment is analysed on the fly.

Culture based sentiment analysis

This approach extracts semantically hidden concepts from mails and incorporates them into supervised classifier training by interpolation. The interpolation method works by interpolating the original unigram language model in Naive Bayes classifier with the generative model of words given semantic concept. Cultural features can be incorporated in similar way where the unigram language model will also be interpolated by the generative model of users given cultural features.

This comparison study focused on detecting the polarity of content, like positive and negative effects and does not consider other types of sentiments such as anger or calmness. Only a few of the methods are able to reach somewhat high level of accuracy. Each of the evaluated tools are trained on specific corpuses. Hence the results of analysis depends on the training set used for model formulation in the tool used for sentiment analysis. Thus, the sentiment analysis tools still have a long way to go before reaching the confidence level demanded by practical applications.

Action without knowledge is wastage and Knowledge without action is futile!!

Big Big Things in my Little Little World

Monday, 31 August 2015

More about Sentiment Analysis