Wednesday, 5 August 2015

Tools used for Sentiment Analysis

1.1      AlchemyAPI

AlchemyAPI consists of both Linguistic and statistical analysis. It was formulated based on the tweets. The linguistic analysis consist of identifying the phrases and how these phrases combine to form sentences. Statistical analysis consist of using mathematical techniques for text analysis. The AlchemyAPI consist of more than 30,000 users. AlchemyAPI Sentiment Analysis APIs are capable of computing document-level sentiment, user-specified sentiment targeting, entity-level sentiment, emoticons and keyword-level sentiment. AlchemyAPI can be easily used with any major programming language: Java, C/C++, C#, Perl, PHP, Python, Ruby, Javascript and Android OS. AlchemyAPI uses REST interface to access the different text algorithms. It can process content as plain text or HTML, and you can use URLs for web-accessible content or raw HTML for private web documents. Most of the functions work with 8 languages: English, German, French, Italian, Portuguese, Russian, Spanish and Swedish. AlchemyAPI is a paid service but it also offers a free API key to get started with 1000 calls per day.

1.2      SentiWordNet

The automatic annotation of all synsets in the wordnet has given rise to SentiWordNet. Four versions of SentiWordNet are available namely: SentiWordNet 1.0, SentiWordNet 1.1, SentiWordNet 2.0 and SentiWordNet 3.0. SentiWordNet 1.0 was based on the concept of bag of words. SentiWordNet 3.0 is widely used. It is freely distributed for noncommercial use, and licensed are available for commercial applications. In SentiWordNet the degree of positivity or negativity ranges from 0 to 1. SentiWordNet was developed by ranking the synsets according to the PoS. The parts of speech represented by the SentiWordNet are adjective, noun, adverb and verb which are represented respectively as 'a', 'n', 'r', 'v'. The database has five columns, the part of speech, the offset, positive score, negative score and synset terms that includes all terms belonging to a particular synset. Offset is a numerical ID, that when matched with a particular part of speech, identifies a synset. The SentiWordNet lexical database was formulated based on the movie review dataset.

Fields
Descriptions
POS
Parts Of Speech linked with synset. This can take four possible values:
           a- Adjective
v- Verb
n- Noun
r- Adverb
Offset
Numerical ID which associated with part of speech uniquely Identifies a synset in the
Database.
PosScore
Positive score for this synset. This is a numerical value ranging from 0 to 1
NegScore
Negative score for this synset. This is a numerical value ranging from 0 to 1.
Synset Terms
List of all terms included in this synset.
Table 2: SentiWordNet Database structure

POS
Offset
PosScore
NegScore
SynsetTerms
a
1740
0.125
0
Able#1
a
2098
0
0.75
Unable#1
n
388959
0
0
divarication#1
n
389043
0
0
fibrillation#2
76948
0.625
0
brazenly#1
r
77042
0.125
0.5
brilliantly#2
v
1827745
0
0
slobber_over#1
v
1827858
0.625
0.125
look_up_to#1
Table 3: Sentiment scores associated to SentiWordNet entries

1.3      Stanford NLP

Stanford NLP is the Java suite of NLP tools developed by the University of Stanford. It consist of a stack of products including Stanford CoreNLP, Stanford Parser, Stanford POS Tagger, Stanford Named Entity Recognizer, Stanford Word Segmenter etc. The movie review dataset was used for training the model in Stanford NLP.  In Stanford NLP the raw text is put into an Annotation object and then a sequence of Annotators add information in an analysis Pipeline. The resulting Annotation, containing all the analysis information added by the Annotators, can be output in XML or plain text forms. The results of Stanford NLP can be accessed in two ways: The first method involves the conversion of annotation object to XML and is written to a file. The second method involves printing the code that gets a particular type of information out of an Annotation. Stanford NLP can be accessed easily from many languages, including Python, Ruby, Perl, Scala, Clojure, Javascript (node.js), and .NET.
The execution flow of Stanford NLP consist of the following phases:
  •          Tokenization: It is the process of chopping a sequence of characters into pieces called tokens.
  •          Sentence Splitting: ssplit property splits a sequence of tokens into sentences.
  •          Part-of-speech Tagging: pos property labels tokens with their POS tags
  •          Morphological Analysis: Morphological Analysis is the process of providing grammatical information of a word given its suffix. The smallest unit in morphological analysis is the morpheme.
  •          Named Entity Recognition: The “ner” property recognizes named (PERSON, LOCATION, ORGANIZATION, MISC), numerical (MONEY, NUMBER, PERCENT), and temporal (DATE, TIME, DURATION, SET) entities from a given text.
  •          Syntactic Parsing: It mainly deals with the grammatical structure of sentences. It consist of identifying phrases, subject or object of a verb.
  •          Coreference Resolution: Coreference means that multiple expressions in a sentence or document refer the same thing. E.g. consider the sentence John drove to Judy’s house. He made her dinner.” In this example both “John” and “He” refer to the same entity (John); and “Judy “and “her “refer to the entity (Judy).
  •          Annotators: The backbone of the CoreNLP package is formed by two classes: Annotation and Annotator. Annotations are the data structure which hold the results of annotators. Annotations are basically maps, from keys to bits of the annotation, such as the parse, the part-of-speech tags, or named entity tags. Annotators tokenize, parse, or NER tag sentences. Annotators and Annotations are integrated by AnnotationPipelines, which create sequences of generic Annotators. Stanford CoreNLP inherits from the AnnotationPipeline class, and is customized with NLP Annotators.

1.4      viralheat API

viralheat API is used to infer the sentiment of a given piece of text. The free account of viralheat API can handle 1000 requests per day and accepts only 360 characters per request.


Just wait for more updates in the next post…


To succeed in your mission, you must have single-minded devotion to your goal.

Saturday, 18 July 2015

Sentiment Analysis

Now a days, people tend to spend more time on social media platforms such as Facebook, Twitter etc. The interactions in social media leave a trail of huge amount of data. Major portion of this data is in textual form. The art of opinion mining is termed as sentiment analysis. It involves the classification of the text into Positive, Negative and Neutral based on the polarity. It also includes determining the attitude of speaker with respect to the topic. Knowing whether the trending tweets about the product are positive or negative helps in identifying areas of improvement for the company.The usage of social media sites is increasing day by day. Hence huge amount of textual data is generated. This data can be used to analyze the opinion of customers in social media. Thus the reputation of a product in social media can be analyzed and it can also be used to generate offers based on the customer preferences. Here comes the importance of sentiment analysis. It involves the process of extracting opinion of the speaker from plain text. This can be also termed as polarity detection. Based on the opinion of the speaker, the text can be classified as Positive, Negative and neutral. Different tools are used for the analysis of customer sentiment. Sentiment Analysis is done using different tools. The tools can be either open source or commercial. AlchemyAPI, SentiWordNet, Stanford NLP, viralheat API, Sentimatrix and python NLTK are some among them.

Types of Sentiment Analysis

Based on the algorithms used sentiment analysis can be classified into different categories. The classification can be either based on the polarity detection method or the structure of the text analyzed.
1. Classification based on the polarity detection method
  • Supervised- It is a machine learning technique in which a classifier is trained based on a feature set.
  • Unsupervised- In the unsupervised method a sentiment lexicon is used to detect the polarity of the given text.
  • Hybrid- A combination of supervised and unsupervised methods form the hybrid method.

2. Classification based on the structure of the text
  • Document level- It aims to find the sentiment for the whole document
  • Sentence level- Here a document is split into sentences and the opinion is analysed for each sentence.
  • Word level- The opinion mining is done for each word in a sentence.

Steps in Sentiment Analysis

The process of sentiment analysis involves four steps namely:
1.    Pre- processing and breaking the text into parts of speech
This involves the following steps:
 POS tagging: It is the process of assigning parts of speech such as noun, verb and adjective to each word in a text. It is done based on treebanks. Treebank is a parsed text corpus that annotates syntactic or semantic sentence structure. Penn Treebank is normally used for this purpose. Eg: “This is a sample sentence” will be output as “This/DT is/VBZ a/DT sample/NN sentence/NN” where DT is the determiner, VBZ is the Verb, 3rd person singular present and NN stands for singular noun.

Number
Tag
Description
1.
CC
Coordinating conjunction
2.
CD
Cardinal number
3.
DT
Determiner
4.
EX
Existential there
5.
FW
Foreign word
6.
IN
Preposition or subordinating conjunction
7.
JJ
Adjective
8.
JJR
Adjective, comparative
9.
JJS
Adjective, superlative
10.
LS
List item marker
11.
MD
Modal
12.
NN
Noun, singular or mass
13.
NNS
Noun, plural
14.
NNP
Proper noun, singular
15.
NNPS
Proper noun, plural
16.
PDT
Predeterminer
17.
POS
Possessive ending
18.
PRP
Personal pronoun
19.
PRP$
Possessive pronoun
20.
RB
Adverb
21.
RBR
Adverb, comparative
22.
RBS
Adverb, superlative
23.
RP
Particle
24.
SYM
Symbol
25.
TO
to
26.
UH
Interjection
27.
VB
Verb, base form
28.
VBD
Verb, past tense
29.
VBG
Verb, gerund or present participle
30.
VBN
Verb, past participle
31.
VBP
Verb, non-3rd person singular present
32.
VBZ
Verb, 3rd person singular present
33.
WDT
Wh-determiner
34.
WP
Wh-pronoun
35.
WP$
Possessive wh-pronoun
36.
WRB
Wh-adverb
Table 1: Alphabetical list of part-of-speech tags used in the Penn Treebank Project
Chunking: The process of identifying phrases is called chunking. Eg: “All the humans are cardboard cliches in this film”. This can be divided into two noun chunks such as: “All the humans” and “cardboard cliche in this film”.
Named Entity Recognition: It involves finding named entities such as names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc. E.g. “Jim bought 300 shares of Acme Corp. in 2006”. This can be parsed as [Jim] Person bought 300 shares of [Acme Corp.]Organization in [2006] Time”.
2.    Subjective/ Objective classification
The different sentences in a document are classified into subjective and objective sentences. Fact based sentences are called objective sentences. E.g. “47% of Americans pay no federal income tax”. Subjective sentences consist of personal opinions, interpretations, points of view etc. E.g. “Spanish is difficult”.
3.    Polarity classification
The subjective sentences are classified into positive, negative or neutral based on the opinion of the speaker in the sentence. Calculating the polarity of each sentence is very important to determine the overall sentiment. The different features of SVM classifier is mainly used for this purpose. It includes:
Bag of Words: Here a text is represented as a bag or multiset of words by disregarding the grammar while preserving the multiplicity. E.g. consider the following two text documents:
John likes to watch movies. Mary likes movies too.
John also likes to watch football games.
Based on these two text documents, a dictionary is constructed as:
{
"John": 1,
"likes": 2,
"to": 3,
"watch": 4,
"movies": 5,
"also": 6,
"football": 7,
"games": 8,
"Mary": 9,
"too": 10
}
Which has 10 distinct words. And using the indexes of the dictionary, each document is represented by a 10-entry vector:
[1, 2, 1, 1, 2, 0, 0, 0, 1, 1]
[1, 1, 1, 1, 0, 1, 1, 1, 0, 0]
Here each entry of the vectors refers to count of the corresponding entry in the dictionary. The first vector represents document 1 while the second vector represents document 2.  In the first vector the first two entries are "1, 2". The first entry corresponds to the word "John" which is the first word in the dictionary, and its value is "1" because "John" appears in the first document 1 time. Similarly, the second entry corresponds to the word "likes" which is the second word in the dictionary, and its value is "2" because "likes" appears in the first document 2 times.
Negation Handling: Negation plays an important role in polarity analysis. E.g. “This is not a good movie” had the opposite polarity from the sentence “This is a good movie”, although the features of the original model would show that they were of the same polarity. So in order to handle the word “good” in first and second sentences diff­erently, polarity is added to the word. Hence the sentence is interpreted as expressing negative opinion.
4.    Sentiment Aggregation
The main task of this step is to aggregate the overall sentiment of the document from the sentences which were tagged positive and negative in polarity classification.

Applications of Sentiment Analysis

  • Product recommendations: Opinion mining can be used to provide recommendations to customers based on the Word of mouth.
  • Reputation analysis in social media: The public opinion regarding a product or service can be analyzed by text mining.

Challenges in Sentiment Analysis


o    Named Entity Recognition – In some sentences it is difficult to find the topic on which the author speaks. E.g. Is 300 Spartans a group of Greeks or a movie?
o    Anaphora Resolution – It is the problem of resolving what a pronoun, or a noun phrase refers to. E.g. "We watched the movie and went to dinner; it was awful." What does "It" refer to?
o    Parsing – This deals with finding what is the subject and object of the sentence or which one does the verb and/or adjective actually refer to?
o    Sarcasm - If we don't know the author we won’t be having any idea whether 'bad' means bad or good.
o    Texts from Social media sites – Ungrammatical sentences, abbreviations, lack of capitals, poor spelling, poor punctuation, poor grammar occurs commonly in social media posts.
o    Detecting  in depth sentiment/emotion- Positive and negative is a very simple analysis, one of the challenge is how to extract emotions like how much hate there is inside the opinion, how much happiness, how much sadness, etc.
o    Finding the object for which the opinion is expressed- For example, if you say "She won him!” this means a positive sentiment for her and a negative sentiment for him, at the same time.
o    Analysis of very subjective sentences or paragraphs- Sometimes even for humans it is very hard to agree on the sentiment of this high subjective texts. Imagine for a computer



Difficult Roads Lead To Beautiful Destinations!!!!!

Tuesday, 14 July 2015

Natural Language Processing


It's been a long time since I wrote a post. So here comes anjusthoughts with a Bang... This post is inspired by one of my colleagues

As we all know, Language is a means of communication. Languages can be broadly classified into two namely:
  • Natural languages are the languages that people speak, such as English, Spanish, and French. These languages are not designed and are evolved naturally.
  • Formal languages are languages that are designed by people for specific applications.


Natural Language Processing




Natural Language Processing or NLP consist of a set of tasks computers perform to understand natural language and generate natural language. The computer is used for the interpretation and analysis of Natural Language.
Natural Language Generation (NLG)
NLG is when a computer writes text of the same quality as that of a human being. It can also be termed as Text Generation.
Natural Language Understanding (NLU)
NLU attempts to understand the meaning behind a written text. NLU faces the challenge of understanding a text without ambiguity, while understanding the rules of the language used. So tow issues must be addressed:
  • What to say- What we are going to talk about
  • How to say- It deals with formulating grammatically correct sentences.

Stages of Natural Language Processing

Natural Language Processing can be divided into three stages namely:
  1. Syntactic Analysis
  2. Semantic Analysis
  3. Contextual Representation
Now let’s look into each of these stages in detail:
  1. Syntactic Analysis
In this phase the input is being checked to ensure that its syntax is correct. This is done based on a grammar. The following are the two simple methods used:
  1. Context Free Grammars(CFG)
Consider the following sentence:
The cat eats rice.”
The parse tree for the above sentence is as follows:

The list of rules for the construction of the tree are:
S -> NP VP
NP -> DET N | DET ADJ N
VP -> V NP
The above sentence consists of:
DET -> the
ADJ-> big|fat
Top- Down Parser
The parser starts with the symbol S and attempts to rewrite the sentence into a sequence of Terminals. The structure of CFG consists of:
  • LHS- It consist of Non terminals or symbols. They cannot be expanded further.
  • RHS- These include terminals or non terminals.
  1. Semantic Analysis
It involves the formulation of a logical representation of the sentence. The meaning of the sentence must be extracted for such a representation.
  1. Contextual Representation
As its name implies the sentence is analysed based on the context. The logical representation is converted into a Knowledge representation. 

More updates about Natural Language Processing in the Next Post....


I am Thankful to all those who said NO. Because of them I did it myself.


Friday, 26 June 2015

Pentaho Data Integration

Pentaho Data Integration or Kettle, consists of a core data integration (ETL) engine, and GUI applications that allow the user to define data integration jobs and transformations.
The name Kettle evolved from "KDE ETTL Environment" to "Kettle ETTL Environment" after the plan of developing the software on top of KDE (K Desktop Environment) was dropped. This tool was open sourced in December 2005 and acquired by Pentaho early in 2006. Matt Casters is the lead developer of Kettle.
ETTL stands for:
  • Data extraction from source databases
  • Transport of the data
  • Data transformation
  • Loading of data into a data warehouse

Kettle

Kettle is a set of tools and applications which allows data manipulations across multiple sources. The main components of Pentaho Data Integration are:
  • Spoon - a graphical tool which make the design of an ETTL process transformations easy to create.
  • Pan - is an application dedicated to run data transformations designed in Spoon.
  • Chef - a tool to create jobs which automate the database update process
  • Kitchen - it's an application which helps execute the jobs in a batch mode, usually using a schedule which makes it easy to start and control the ETL processing
  • Carte - a web server which allows remote monitoring of the running Pentaho Data Integration ETL processes through a web browser.

Downloading Pentaho Data Integration


Steps for installation(in Windows)

  1. Unzip the folder
  2. A folder named data-integration is created. In the folder data-integration open spoon.bat file(just double click it).

Steps for installation(in Linux)

    Run the spoon.sh file.

Connecting to Progress database using Pentaho

  1. Add the jars base.jar, openedge.jar, pool.jar, spy.jar, util.jar in the \Pentaho-Kettle\data-integration\libext\JDBC folder.
  2. Double click Transformations in the view tab. Then a new transformation is created.
  3. To change the name of the transformation, right click the newly created transformation ie, “Transformation 1” . Select settings and edit the transformation name.
  4. To connect to Progress database, right click database connections and select new.
  5. Select general, add a connection name(Eg: test). Select generic database as connection type.
  6. Custom connection URL: jdbc:datadirect:openedge://hostName:50590;databaseName=dbName;defaultSchema=PUB
  7. Custom driver class name: com.ddtek.jdbc.openedge.OpenEdgeDriver
  8. Username: userName
  9. Password: passWord
  10. Click test. It will show connection successful. Click OK.

Exporting tables as csv files

  1. Change to design view tab
  2. Select Table input under Input(Drag and drop Table input to Transformation1 window)
  3. Select Text file output under Output(Drag and drop Text file output to Transformation1 window)
  4. Right click the Text file output icon in the window and select “edit step”. In the new window opened we can either specify a file name or we can browse for a location to save the file in the “file” tab. In the content tab we can specify the separator, ie To export as csv file specify the separator as ,(comma). Click OK.
  5. Click Table input ,press shift along with left button of mouse and drag to Text file output. Thus a hop is created.
  6. Right click Table input and select edit. Specify the SQL query to be executed. Click OK.
  7. Click run(green triangle) . We can see the execution.
The importance of water is not known until the stream runs dry!!