In this Project we will be building a Sentiment Analyzer that will be used to judge sentiment of movie reviews.
The dataset has been taken from the rotten tomatoes website.We will be dealing will Natural language processing in python and will be sung several modules of python for NLP.This notebook aims at cleaning data as we are dealing with raw reviews.Lets start
Lets import our libraries
we are having only one data file train.tsv
. Now tsv(tab seprate values) are another extension of data files just like .xls or .csv. Pandas has a special function read_table which allows us to deal with these files/
lets see what we have in these files
PhraseId | SentenceId | Phrase | Sentiment | |
---|---|---|---|---|
0 | 1 | 1 | A series of escapades demonstrating the adage ... | 1 |
1 | 2 | 1 | A series of escapades demonstrating the adage ... | 2 |
2 | 3 | 1 | A series | 2 |
3 | 4 | 1 | A | 2 |
4 | 5 | 1 | series | 2 |
The Phrase has all the Sentiments of the reviews.The reviews have been parsed by the stanford parser and split into sentences. Each review has a sentence id
and a phrase id
.
Now if we observe the sentiments columns we have multiclass labels to sentiments.sentiments have 0,1,2,3,4 numeric classes.Tne dataset source stated that what each numeric label meant so
0
: Negative
1
: Somewhat Negative
2
: Neutral
3
: Positive
4
: Somewhat Postive
For visualization purpose lets convert the numeric columns to categorical.Lets write a method to convert each row of pandas dataframe to numerical
Now we will analyze the distribution of various sentiments of dataset
Looks like we have a lot of samples belonging to neutral class :-).People try to be Diplomatic in film reviews
In this Section we will focus only on the Positive Sentiments from the dataset.We will breakdown sentiments into words,Clean the data and visualize.
Lets extract the positive sentiments out of the Sentiments column
Now we will split each sentence into tokens of words.
After this we will count the words which occur in positive sentiments and their counts.this will help us understand what positive sentence mainly comprises of and what do we actually have in them.
Framing the results into dataset for visualization.
Lets do a barplot
It seems like the most common word or element or token is ,
followed by the
and other
articles,pos(parts of speech).It seems pretty obvious because these comprises of most of what we speak in daily life. The
and and
are the most common words in English Sentences.
But regarding building predictive model for sentiments analysis these wont play a special role as they would be common to all sentences.
Now its time to gear up and clean our reviews. we have seen in above visualization that these are prominent in our reviews and don't play much important role.we would be cleaning reviews using the python Natural Language Processing library
NLTK
we have imported stopwords Stemmer Lemmatizer and downloaded the wordnet for Lemmatizer.we will see what each of these are and how they will clean our reviews
Stop words are words which are very repetitive and useless.These words means nothing, unless of course we're searching for someone who is maybe lacking confidence, is confused, or hasn't practiced much speaking. We all do it, you can hear me saying "umm" or "uhh"
Stemming is the process of reducing a word into its stem, i.e. its root form.root form of eating is eat,sleeping is sleep.converting words to roots reduces ambiguity in sentences
Lemmatization is somewhat similar to stemming but it maps several different words to a common root
for eg: go,going,went are mapped to go
Now we will write a function that does this for us . Since Both Stemming and Lemmatization reduces words into their root forms , any one can be used.I will be using Lemmatization in this Notebook.we will also use python inbuilt regular expression
library re
for removing symbols and special characters and Numbers.
The above function processes each of the raw positive raw reviews it the data and return the cleaned review.we have set Lemma = 1
so that i could easily switch using the same function
Now lets gets the things done!!
0 series escapade demonstrating adage good goose... 1 series escapade demonstrating adage good goose 2 series 3 4 series 5 escapade demonstrating adage good goose 6 7 escapade demonstrating adage good goose 8 escapade 9 demonstrating adage good goose 10 demonstrating adage 11 demonstrating 12 adage 13 14 adage 15 good goose 16 17 good goose 18 19 good goose 20 21 good goose 22 good 23 goose 24 25 goose 26 goose 27 also good gander occasionally amuses none amou... 28 also good gander occasionally amuses none amou... 29 upto no of positive reviews
Now we have seen how will be dealing with raw reviews,Lets apply it to the whole sentiments data.
Now lets clean all the reviews by applying the clean_reviews
to all sentiments and create a new column in the dataframe 'cleaned_reviews'
It takes some 3-5 minutes to process as it has 1,50,000 reviews.Now we will be using our cleaned reviews as feature and start building our predictive model using various NLP techniques
Bag of Words
,TfIdf(Term Frequency Inverse Document Frequency)
and Word2vec
.We
will be using Bag of Words and Tfidf for Feature Extraction.Lets see both of the algorithms.
The bag-of-words model is a way of representing text data when modeling text with machine learning algorithms. The bag-of-words model is simple to understand and implement and has seen great success in problems such as language modeling and document classification.
TFIDF(Term frequency inverse document frequency) is also one of the techniques for dealing with text data.In Bag of Words model,we simply count the occurrence of words and simply write the count at a particular index in the feature vector . But since we are only using count , words which are rare in the reviews will have less impact on predicting the sentiment . TFIDF measures the number of times that words appear in a given document(Term Frequency).The Idf term is the total occurrence of a word in all documents to total no of documents in the dataset.
Now we will be building a sklearn pipeline to use both the Bag of Words and Tfidf to calculate feature vectors
TfIdf
and CountVectorizer
are the sklearn implementation of Tfidf and Bag of Words implementation of sklearn.
<156060x5000 sparse matrix of type ''with 524767 stored elements in Compressed Sparse Row format>
We have got a sparse matrix of 156000x5000 where each row corresponds to a single document and a feature vector of 5000 features. CountVectorizer
allows us choose an arbitrary size of feature vector
Now we will fit a TfIdf vectorizer on the bag of words feature
<156060x2000 sparse matrix of type ''with 419006 stored elements in Compressed Sparse Row format>
Now we will train a naive Bayes classifier for sentiment analysis.Naive Bayes is very popular for text classification and is used popularly.
We will split our whole tfidf sparse matrix into train and test using sklearn's train_test_split
Lets train our model on Multinomial Naive Bayes which is used for Multi classification Problem.
Now lets predict the values in the test set and see its accuracy
0.562
Its seems like we have achieved a good accuracy better than random guessing.Naive Bayes is not performing that well in sentiment analysis.
As we have seen naive Bayes is not giving better predictions,Lets use random forest an ensemble of Decision Trees.
Lets check the predictions and accuracy of our random forest model.
accuracy : 0.6189708737864078
Whoo ! a 5% increment in accuracy using random forest.we have improved our model performance significantly using random forest.