Now
a days, people tend to spend more time on social media platforms such as
Facebook, Twitter etc. The interactions in social media leave a trail of huge
amount of data. Major portion of this data is in textual form. The art of
opinion mining is termed as sentiment analysis. It involves the classification
of the text into Positive, Negative and Neutral based on the polarity. It also
includes determining the attitude of speaker with respect to the topic. Knowing
whether the trending tweets about the product are positive or negative helps in
identifying areas of improvement for the company.The usage of social
media sites is increasing day by day. Hence huge amount of textual data is
generated. This data can be used to analyze the opinion of customers in social
media. Thus the reputation of a product in social media can be analyzed and it
can also be used to generate offers based on the customer preferences. Here
comes the importance of sentiment analysis. It involves the process of
extracting opinion of the speaker from plain text. This can be also termed as
polarity detection. Based on the opinion of the speaker, the text can be
classified as Positive, Negative and neutral. Different tools are used for the
analysis of customer sentiment. Sentiment Analysis is done using different
tools. The tools can be either open source or commercial. AlchemyAPI,
SentiWordNet, Stanford NLP, viralheat API, Sentimatrix and python NLTK are some
among them.
Types of Sentiment Analysis
Based on the algorithms
used sentiment analysis can be classified into different categories. The
classification can be either based on the polarity detection method or the
structure of the text analyzed.
1. Classification based on the polarity
detection method
- Supervised- It is a machine learning technique in which a classifier is trained based on a feature set.
- Unsupervised- In the unsupervised method a sentiment lexicon is used to detect the polarity of the given text.
- Hybrid- A combination of supervised and unsupervised methods form the hybrid method.
2. Classification based on
the structure of the text
- Document level- It aims to find the sentiment for the whole document
- Sentence level- Here a document is split into sentences and the opinion is analysed for each sentence.
- Word level- The opinion mining is done for each word in a sentence.
Steps in Sentiment Analysis
The
process of sentiment analysis involves four steps namely:
1. Pre- processing and breaking the text into parts of speech
This
involves the following steps:
POS tagging: It is the process of
assigning parts of speech such as noun, verb and adjective to each word in a
text. It is done based on treebanks. Treebank is a parsed text corpus that
annotates syntactic or semantic sentence structure. Penn Treebank is normally
used for this purpose. Eg: “This is a sample sentence” will be output as
“This/DT is/VBZ a/DT sample/NN sentence/NN” where DT is the determiner,
VBZ is the Verb, 3rd person singular present and NN stands for singular noun.
Number
|
Tag
|
Description
|
1.
|
CC
|
Coordinating
conjunction
|
2.
|
CD
|
Cardinal
number
|
3.
|
DT
|
Determiner
|
4.
|
EX
|
Existential
there
|
5.
|
FW
|
Foreign
word
|
6.
|
IN
|
Preposition
or subordinating conjunction
|
7.
|
JJ
|
Adjective
|
8.
|
JJR
|
Adjective,
comparative
|
9.
|
JJS
|
Adjective,
superlative
|
10.
|
LS
|
List item
marker
|
11.
|
MD
|
Modal
|
12.
|
NN
|
Noun,
singular or mass
|
13.
|
NNS
|
Noun,
plural
|
14.
|
NNP
|
Proper
noun, singular
|
15.
|
NNPS
|
Proper
noun, plural
|
16.
|
PDT
|
Predeterminer
|
17.
|
POS
|
Possessive
ending
|
18.
|
PRP
|
Personal
pronoun
|
19.
|
PRP$
|
Possessive
pronoun
|
20.
|
RB
|
Adverb
|
21.
|
RBR
|
Adverb,
comparative
|
22.
|
RBS
|
Adverb,
superlative
|
23.
|
RP
|
Particle
|
24.
|
SYM
|
Symbol
|
25.
|
TO
|
to
|
26.
|
UH
|
Interjection
|
27.
|
VB
|
Verb, base
form
|
28.
|
VBD
|
Verb, past
tense
|
29.
|
VBG
|
Verb,
gerund or present participle
|
30.
|
VBN
|
Verb, past
participle
|
31.
|
VBP
|
Verb,
non-3rd person singular present
|
32.
|
VBZ
|
Verb, 3rd
person singular present
|
33.
|
WDT
|
Wh-determiner
|
34.
|
WP
|
Wh-pronoun
|
35.
|
WP$
|
Possessive
wh-pronoun
|
36.
|
WRB
|
Wh-adverb
|
Table
1: Alphabetical list of part-of-speech tags used in the Penn Treebank Project
Chunking: The process of
identifying phrases is called chunking. Eg: “All the humans are cardboard
cliches in this film”. This can be divided into two noun chunks such as: “All
the humans” and “cardboard cliche in
this film”.
Named Entity Recognition: It involves finding
named entities such as names of persons, organizations, locations, expressions
of times, quantities, monetary values, percentages, etc. E.g. “Jim bought
300 shares of Acme Corp. in 2006”. This can be parsed as “[Jim] Person bought 300 shares of [Acme Corp.]Organization in
[2006] Time”.
2. Subjective/ Objective classification
The
different sentences in a document are classified into subjective and objective
sentences. Fact based sentences are called objective sentences. E.g. “47% of Americans pay no
federal income tax”. Subjective sentences
consist of personal opinions, interpretations, points of view etc. E.g. “Spanish is difficult”.
3. Polarity classification
The
subjective sentences are classified into positive, negative or neutral based on
the opinion of the speaker in the sentence. Calculating the polarity of each
sentence is very important to determine the overall sentiment. The different
features of SVM classifier is mainly used for this purpose. It includes:
Bag of Words: Here a text is
represented as a bag or multiset of words by disregarding the grammar while
preserving the multiplicity. E.g. consider the following two text documents:
John
likes to watch movies. Mary likes movies too.
John
also likes to watch football games.
Based
on these two text documents, a dictionary is constructed as:
{
"John":
1,
"likes":
2,
"to":
3,
"watch":
4,
"movies":
5,
"also":
6,
"football":
7,
"games":
8,
"Mary":
9,
"too":
10
}
Which
has 10 distinct words. And using the indexes of the dictionary, each document
is represented by a 10-entry vector:
[1,
2, 1, 1, 2, 0, 0, 0, 1, 1]
[1,
1, 1, 1, 0, 1, 1, 1, 0, 0]
Here
each entry of the vectors refers to count of the corresponding entry in the
dictionary. The first vector represents document 1 while the second vector
represents document 2. In the first vector the first two entries are
"1, 2". The first entry corresponds to the word "John"
which is the first word in the dictionary, and its value is "1"
because "John" appears in the first document 1 time.
Similarly, the second entry corresponds to the word "likes" which is
the second word in the dictionary, and its value is "2" because
"likes" appears in the first document 2 times.
Negation Handling: Negation plays an
important role in polarity analysis. E.g. “This is not a good movie” had
the opposite polarity from the sentence “This is a good movie”, although the
features of the original model would show that they were of the same polarity.
So in order to handle the word “good” in first and second sentences differently,
polarity is added to the word. Hence the sentence is interpreted as expressing
negative opinion.
4. Sentiment Aggregation
The
main task of this step is to aggregate the overall sentiment of the document
from the sentences which were tagged positive and negative in polarity
classification.
Applications of Sentiment Analysis
- Product recommendations: Opinion mining can be used to provide recommendations to customers based on the Word of mouth.
- Reputation analysis in social media: The public opinion regarding a product or service can be analyzed by text mining.
Challenges in Sentiment Analysis
o Named Entity Recognition – In some sentences it is difficult to find the
topic on which the author speaks. E.g. Is 300 Spartans a group of Greeks or a movie?
o Anaphora Resolution – It is the problem of resolving what a pronoun,
or a noun phrase refers to. E.g. "We watched the movie and went to
dinner; it was awful." What does "It" refer to?
o Parsing – This deals with finding what is the subject and object of the
sentence or which one does the verb and/or adjective actually refer to?
o Sarcasm - If we don't know the author we won’t be having any idea whether
'bad' means bad or good.
o Texts from Social media sites – Ungrammatical sentences, abbreviations, lack
of capitals, poor spelling, poor punctuation, poor grammar occurs commonly in
social media posts.
o Detecting in depth sentiment/emotion- Positive and negative is a very simple
analysis, one of the challenge is how to extract emotions like how much hate
there is inside the opinion, how much happiness, how much sadness, etc.
o Finding the object for which the opinion is
expressed- For example, if you
say "She won him!” this means a positive sentiment for her and a
negative sentiment for him, at the same time.
o Analysis of very subjective sentences or
paragraphs- Sometimes even for
humans it is very hard to agree on the sentiment of this high subjective texts.
Imagine for a computer
Difficult Roads Lead To Beautiful Destinations!!!!!