Sunday, 5 November 2017

Naive Bayes Text classification


Doc -> {+, -}
Documents are a vector or array of words
Conditional independence assumption: No relation exists between words and they are independent of each other.
Probability of review being positive is equal to probability of each word classified as positive while going through the entire length of document 

Unique words- I, loved, the, movie, hated, a, great, poor, acting, good [10 unique words]
Involves 3 steps:
1. Convert docs to feature sets
2. Find probabilities of outcomes
3. Classifying new sentences

 Convert docs to feature sets

Attributes: all possible words
Values: no: of times the word occurs in the doc

FinProbabilities of outcomes

P(+)=3/5=0.6
No: of words in + case(n)=14
No: of times word k occurs in these cases + (nk) 
P(wk | +) =(nk + 1) /(n+|vocabulary|)
P(I|+)=(1+1)/(14+10)=0.0833
P(loved|+)=(1+1)/(14+10)=0.0833
P(the|+)=(1+1)/(14+10)=0.0833
P(movie|+)=(4+1)/(14+10)=0.2083
P(hated|+)=(0+1)/(14+10)=0.0417
P(a|+)=(2+1)/(14+10)=0.125
P(great|+)=(2+1)/(14+10)=0.125
P(poor|+)=(0+1)/(14+10)=0.0417
P(acting|+)=(1+1)/(14+10)=0.0833
P(good|+)=(2+1)/(14+10)=0.125
docs with –ve outcomes
p(-)=2/5=0.4
P(I|-)=(1+1)/(16+10)=0.125
P(loved|-)=(0+1)/(6+10)=0.0625
P(the|-)=(1+1)/(6+10)=0.125
P(movie|-)=(1+1)/(6+10)=0.125
P(hated|-)=(1+1)/(6+10)=0.125
P(a|-)=(0+1)/(6+10)=0.0625
P(great|-)=(0+1)/(6+10)=0.0625
P(poor|-)=(1+1)/(6+10)=0.125
P(acting|-)=(1+1)/(6+10)=0.125
P(good|-)=(0+1)/(6+10)=0.0625

Classifying new sentence

Eg: I hated the poor acting
Probability of sentence being positive,
P(+).P(I|+).P(hated|+).P(the|+).P(poor|+).P(acting|+)
0.6*0.0833*0.0417*0.0833*0.0417*0.0833=6.0*10-7
Probability of sentence being negative,
P(-).P(I|-).P(hated|-).P(the|-).P(poor|-).P(acting|-)
0.4*0.125*0.125*0.125*0.125*0.125=1.22*10-5
So the sentence is classified as negative.
If the word is not present in the vocabulary a very tiny probability is assigned to the word.

A calm and modest life brings more happiness than the pursuit of success combined with constant restlessness.

No comments:

Post a Comment