Sentiment allows the user to type only 140
Sentiment Analysis on Twitter Data using Word2Vec12.06.2017?Arvind Ram Singh Kishore (G01035985)Viswanath Subramanian Ramesh (G01061924) AbstractThe problem of mining text data to find inferences is a well-explored one. Over the years, several methods have been introduced to cluster, classify and analyze text data. However, a lot of these approaches often use word frequency techniques and treat the problem in a linear space without any domain knowledge. A natural assumption that is made is that the data that is provided is exhaustive. In this paper, we propose to dynamically embed each word into a high dimensional vector space using Word2vec which is a deep learning neural network model designed by Google. We then classify these vectors into positive and negative using various classifications algorithms in order to compare and contrast, and determine the most suitable classification technique for the Word2Vec model. I. IntroductionWith almost everyone on earth having access to technology, voicing out our opinion on the internet has become very common. There are new technologies coming out almost as frequent as every single day. One such technology that stole the spotlight is Twitter. According to the statistics available online, Twitter roughly has 1.3 billion users of which it has 310 million monthly active users who regularly log back in to share their thoughts at the rate of almost 500 million tweets per day. With this high magnitude of data created every day, this area can be used to perform various text mining analysis to understand if the people have a positive or a negative opinion about any given idea. These tweets will help us uncover various invaluable insights into user’s thoughts. However, there are various challenges when it comes to mining twitter data because of the word limit posted by twitter which allows the user to type only 140 characters in a given tweet out of which most of the elements like username tagging, Hashtags and URLs are unnecessary information. This space limitation not only leads to incompleteness of tweets but also forces people to abbreviate their words and sentences which leads them to form grammatically incorrect words. Moreover, the algorithm fails to detect sarcasm which is a very common human trait where negative tweets can be misclassified as positive tweets. There are more recently found technologies such as Word2Vec which converts a word into a high dimensional(~200-300) vector space and also help capture the context of the word while grouping them along with similar words when mapped into vector space. Our study aims at analyzing and classifying the sentiments of the tweets into 2 categorical form i.e, positive and negative. Sentiment analysis is a Natural Language Processing task which deals with analyzing text and syntactic context thus identification of the subjective information on twitter posts becomes possible. Standard algorithms for text classification include: Gaussian Naive Bayes, Support Vector Classifiers and Logistic Regression. These classifiers have helped prove successful in various text classification problems mostly because we only require a binary output, positive or negative, 1 or 0. However, the vector representation of each word is present in high dimensional space which may pose a challenge to figure out the best possible classifier that is suitable for this representation. Thus, we propose to run the embedded word using various classifiers to compare and contrast their efficiency through various parameter tuning. We finally aim to figure the best possible classifier to pair with Word2Vec for optimal output. This project can be extended into various applications such as recommender systems and review classification etc. Any trending product/movie/place can be taken into consideration and be analyzed to see if that particular thing has a positive or a negative effect on the people. II. MethodologyWe were able to obtain our dataset from an open source github bucket which provided us with more than sufficient data (~1.6 Million Tweets). The dataset contained tweets which were classified manually as positive or negative. The neutral sentiment was not taken into consideration to keep the complexity of the model low. Although the dataset had multiple attributes like class, UserID, date of the tweet along with the actual tweet, only the class attribute and the tweet was chosen to build the model. The reason for this specific choice of attributes was because, semantic analysis of words/sentences rely on the actual data, i.e, all the words which are taken as individual dimensions, and parameters like date and ID do not affect the learning model. Positive tweets had a class label of ‘1’ and negative samples had a class label ‘0’. Due to high magnitude of negative tweets and insufficient computational power, the dataset was reduced from 1.6 Million to 1 Million. Before training the machine learning model on the dataset we need to conduct some exploratory analysis to get a better understanding of what the data entailed. The graph below shows the breakdown of the dataset. We can observe from the graph that negative tweets comprise of 800,000 and the positive tweets comprise of 250,000. Positive tweets are relatively 1/4th of negative tweets. In these cases, the model we build, regardless of its complexity, is going to be biased or inclined towards classifying tweets as negative. Various techniques can be used to overcome dataset biasing such as oversampling and undersampling. Oversampling is the technique of picking random items from minority class with replacement and replicating them. There are two methods that can be used to perform oversampling which are Random oversampling and Synthetic Minority Over-sampling Technique(SMOTE). RandomOverSampler is the class that was used in order to oversample the data. Moreover, we must be careful to oversample only the training data as oversampling the test data may yield misleading higher accuracy as repetitive itemsets in the test data are classified correctly. This phenomenon is known as “bleeding” of data into test data as we must ensure to maintain novel test data. Although, oversampling doesn’t completely solve the issue because model only re-learns already learnt data which might be deemed as useless information, it has been proven to be useful in solving sample imbalances.Pre-ProcessingThe dataset was cleaned of all unwanted sections like username, hashtags and links. After removing all these terms, we tokenize every word in the sentence using NLTK’s Tweet tokenizer to convert them into a list of strings. We then use the LabeledSentence method available in class Doc2Vec to convert all the sentences into sentence objects creating individual tokens. Tokens refer to instances of a sequence of characters in a document grouped together as a semantic unit for processing1. The LabeledSentence creates an instance of TaggedDocument to serve the purpose. This method aggregates all the words in a sentence into an object attached with a unique tag. Therefore each sentence becomes a unique list of strings and tag pair.Due to twitter’s 140 character limitation, we did not remove the stop words (i.e. prepositions and articles etc.) under the premise that they contribute to the semantic structure of the sentence. It is vital for the word2vec model to capture the context of these words.The word embedding was done for all the words in the tweets using the Word2Vec library under gensim which created a vector of 300-dimensions. After this was done, to be used in a classifier, each word had a vector representation but the tweets were in the form sentences. A method had to be decided to represent our entire tweet in a vector format. At this point, we had 2 possible options. One obvious and widely used option was to add the vectors of each word in a sentence and average them by the length of the tweet. But there is always a possibility of a tweet comprising of multiple instances of articles and prepositions which may not really contribute to the context of the tweet. Therefore, frequency index(TF-IDF value) of all the words were calculated. The TF-IDF score gives the importance of each word in any given document in a corpus. TF is the term frequency which is used to measure the frequency of occurrence of a specific term in a document. To avoid the case where certain terms might occur multiple times in a longer document, compared to shorter ones, the TF is divided by the number of words in the document, or otherwise referred to as the length of the document2*. TF calculation takes all the words and considers them as equally important in the document. Therefore the IDF, referred to as the Inverse Document Frequency, is calculated as the log of the total number of documents to the number of documents with the specified term in it2*. This inverse proportion would give a higher value for terms that do not occur frequently and a lower value for frequently occurring terms. The product of both, would give us the TF-IDF score2*. This helped us in deciding the more efficient method to vectorize tweets which is considering their TF-IDF scores as weights of each word which further can be used to multiply with its respective word vectors, adding all these values for each and every word that compose a tweet and then divide the value by the length of the tweet. This method was particularly useful in our experiment as we did not remove any of the stop words like ‘as’, ‘is’ and ‘the’ etc., due to the character limitation of tweets.Once the vectorization was done, the imbalances in the dataset had to be resolved. Random positive samples were oversampled to equalize the negative sample count. All the average vectors were scaled using scikit-learn’s scale method which is provided as part of its preprocessing library. Vector Data VisualizationWe need to visualize the data for preliminary analysis and assessment but it is impossible to visualize data that contains 300 dimensions. Therefore, we need to perform dimensionality reduction in-order to visualize the data. We used t-distributed stochastic neighbor embedding(t-SNE) to perform the required dimensionality reduction. t-SNE helps convert similarity between data points to joint probabilities and tries to minimize the Kullback-Leibler divergence between the joint probabilities of the low-dimensional embedding and the high-dimensional data3. The parameter that were chosen are n_components = 2, verbose = 1 and random_state = 0. Using these parameters, we were able to convert the 300 dimensional words to 2-D vectors which could be plotted in Euclidean Space. This can further be plotted using matplotlib.pylab library available in the scikit learn machine learning library.Word2Vec & Doc2VecWord2Vec is a 2-layer neural network algorithm which creates word embeddings. It converts english words to high dimensional vectors varying from 100 to 1000 dimensions. This model is trained to reconstruct and capture the context of the word through semantic analysis. Each unique word in the corpus is assigned a corresponding vector in the space. These vectors are created in such a way that the words that share the same context are placed in close proximity to one another.To produce the required distributed representation of words it utilizes two different models i.e., Continuous Bag of Words(CBOW) and Skip-Gram4. The CBOW is used to predict the next word using the context of the available words whereas Skip-Gram is used to predict the context surrounding the current word. In this analysis, we use both the models to draw our results and compare them with each other. If the parameter sg = 1 is set then Skip-Gram is used and if sg = 0 is set then CBOW is used. By default, sg is set to 0 making CBOW the default model in Word2Vec. Furthermore, we can fine tune the parameters more by choosing Hierarchical softmax or Negative sampling. If the parameter hs = 1 is set then Hierarchical softmax is used and if hs = 0 is set then Negative Sampling is used. By default, negative sampling is used and if the magnitude of the negative sampling is greater than 0 then it represents the noise words which is usually between 5 to 10. Our analysis yielded higher efficiency for Hierarchical Softmax, therefore, we are going to stick with hs = 1 parameter throughout the project. The below image shows the vector representation of the word ‘data’. ClassificationSeveral classifiers were used from the scikit learn machine learning toolkit to classify the sentiment of the tweets. The model was trained using Gaussian Naive Bayes, Support Vector Classifier and Logistic Regression. The reason for choosing naive bayes and support vectors are, they are generally proven to be good classifiers and also the reference paper suggested the use of those classifiers. Logistic regression is a good classifier when the final class prediction is binary, i.e, there are only 2 classes. In this experiment, there are only 2 output classes(1 for positive sentiment and 0 for negative sentiment). Similarity Test’Most_similar’ is one of the most important and interesting methods that is given as part of the Word2Vec library. The primary purpose of this method is to find contextually similar words, given a specific term.For example, given a word ‘beer’, the most_similar method returns, ‘wine’, ‘fruit’, ‘sauce’, ‘soda’ and ‘vodka’ etc as the similar terms. Along with the similar terms, it also provides a similarity score that ranges from 0-1, showing increasing similarity from 0 to 1. Multiple applications like recommender systems, can be built based off this method and as a extension of product review classification. Wherein, if a product has a lot of negative reviews, a suggestion can be made for another similar product that has better reviews. But the only precondition is that the corpus should be populated with all the similar products along with their reviews and description, as that is important to capture the context/domain of the products. III. Experimental ResultResultsThere were certain general observations that were made when experimenting with the various parameters of the classifiers and the Word2Vec model. The google documentation5 on word2vec claimed that the Hierarchical softmax training algorithm performs better than negative sampling. This was tested out by varying the parameters in the word2vec model. The observations confirmed the claims and therefore, hierarchical softmax was used throughout the rest of the experiment.The Word2Vec model was tested with 2 different models, the Continuous Bag of Words(CBOW) and the Skip-Gram model. The classification was done with 3 classifiers mentioned above. Accuracy Percentage by Classifier and Training ModelIt is to be noted that the accuracy percentages in the table are rounded to the nearest whole number and are a result of averaging multiple runs with the same parameters. It is evident from the table that the Skip-Gram model outperformed the Continuous Bag of Words(CBOW) model with every classifier. The Gaussian Naive Bayes was able to achieve 58% accuracy in CBOW model whereas in Skip-Gram model it showed an improved accuracy of 62%. The Support Vector Classifier when used along with the CBOW model classified 67% of the tweets correctly which was the best among the other CBOW experiments. However, the best overall accuracy of 74% was obtained with the same classifier when combined with the Skip-Gram model. The Logistic Regression algorithm constantly achieved an accuracy closer to the SVM in both the models, with a 64% in CBOW model and 72% in Skip-Gram model. Looking at all three classifier analysis we could conclude that Support Vector Classifier embedded with Skip Gram model classified with highest accuracy. CM -ve +ve -ve10242519539 +ve3498752765 Confusion Matrix for SVM with Skip-Gram ModelThe train-test split was fixed as 80% for the training data and 20% for the test data. Above is the confusion matrix for the classification done on the test data of 209,716 tweets which is roughly 20% of the total dataset, when trained with the SVM classifier coupled with the Skip-Gram model in Word2Vec. The respective numbers(correct and incorrect) for the negative tweet classification gives away the trend of the classification. Due to the higher number of negative samples, among the 74% correct classifications, the number of properly classified negative tweets is higher than the properly classified positive tweets. Also among the incorrect classifications, it is apparent that the possibility of classifying a positive tweet as a negative tweet was higher than classifying a negative as positive. For the negative class, the classifier predicted with a precision rate of 84% and F1-score of 79%. As for the positive class, the classifier came up with a 61% precision rate along with a F1-score of 66%. Although oversampling was done in the training data to settle the imbalances, the repetitive learning of familiar(positive) tweets did not contribute much in classifying novel positive tweets, as shown by the numbers in the confusion matrix.LimitationsAs mentioned earlier, we faced various limitations in this project due to use of twitter dataset. One of the biggest issue was the length of the tweet which caused a huge hindrance in improving the efficiency of the model. After the removal of user reference, hashtags and links length of the tweet drastically reduced. Moreover, people tend to abbreviate their words in order to fit them under this length limitation. This posed a problem to capture the exact context of the sentence which became a learning problem.This model failed to detect sarcasm as the idea of sarcasm by itself is a fragile concept. Many a time, we humans ourselves fail to detect sarcasm in a conversation. Therefore the model classifying a negative sarcastic tweet as a positive one, was expected but was still an issue to be worked on. Furthermore, taking into consideration the size of the dataset of 1.6 million tweets, the training process was computationally intensive. Word2Vec is already known to perform relatively slower on huge dataset. In our case, our model had taken few hours to generate results as it had to form semantic context from 53 million possible combinations of unique words. Moreover, our personal computers were not built to handle such intensive computations which added to the delay. Therefore, trial and error method for parameter tuning of word2vec was very time consuming. Adding on to these limitations, our dataset had uneven distribution of positive and negative sample which was solved using oversampling techniques. Although, oversampling did not completely solve this issue, the model had to relearn already learnt positive data which took time and at the same instance, did not contribute much to improve the efficiency of the algorithm. Whereas, for negative samples the dataset items were robust and distinct which lead to the model being biased towards negative samples.