Analysis of Supervised Learning for Subjective Classification of ‘Quit smoking or not’ tweets Mohanakrishnakumar Karunakaran
Srinivasan Venkatesan
Department of Computer Science University of Illinois at Chicago
[email protected]
Department of Computer Science University of Illinois at Chicago
[email protected]
Abstract - Social media platforms such as Twitter, Facebook are rapidly becoming key resources for public health surveillance, which aid in the introduction and review of new products in the market. There are a lot of tweets about various health problems, their preventive measures and much more. Here, we use one such dataset from Twitter which contains tweets about "Quitting smoking or Not". Our aim is to subjectively classify tweets based on their content and sentiment using different NLP Classifier Models and analyze their performance. We create different feature sets and analyze the classification performances of two classes namely “Cessation” and “No Cessation” using Naïve Bayesian and SVM classifiers. We also include the effects of each feature set and suggest ideas to improve the classification accuracy.
containing keywords about electronic cigarette keywords. The statistics of the corpus are as follows: Tweet Types Total Training Total Testing Total Cessation Total Non Cessation
Count 4310 2139 2371 4078
2/3rd of the data will be used as training data for the classifiers and the remaining 1/3rd will be used as the test data. The tweet also contains metadata like tweet id, follower count, friend count and the posted time, which might be useful. A. Problem Summary The point of interest in this research is the mention about quitting smoking (Class: Cessation) in the text tweet, the username, or the URL. Our aim is to analyze the performance of different supervised learning models with interest in the views to quit smoking using two classes namely “Cessation” – which express view to quit smoking, and “No Cessation” – which do not express view to quit smoking. We show the performances of Naïve Bayes and Support Vector Machines (SVM).
I. INTRODUCTION Twitter is a popular online “micro blogging” service that enables users to exchange very short messages of length 140 characters. Its rapid growth makes it ideal for diverse research as its humongous data, in its entirety, reflects people’s reaction, views, emotions, feelings of disgust, and sentiments on variety of incidents. A new electronic gadget release, a natural calamity, terrorist incidents and elections are a few to name.
B. Approach
In this paper, we focus on the problem of the subjective classification of tweet dataset provided by the Health Science Research Team at The University of Illinois at Chicago (UIC) posted in May/June 2012
First, we start our approach by performing the data preprocessing of the tweets using the following steps including but not limited 1
to stop words removal, spell checking and lemmatization. Next, we evaluate different feature sets, which include unigram selection, top 200 bigrams selection, emoticons conversions, include high frequency synset, hypernym, hyponyms based on the part of speech of every feature word using WordNet [1], use a spell checker using Peter Norvig’s algorithm [2] to correct proper English words in correlation with WordNet. Next, we obtain a feature vector to build the sparse matrix. We used Naïve Bayes Binomial provided by NLTK [3], SVM provided by Scikit-Learn [4] to analyze our features. The sparse matrix from the training data along with the corresponding class for every tweet is used to train the classifiers. Once the classifiers are trained, the test data is then classified after the same data preprocessing and feature extraction steps. An overview of the approach is show below:
more important in sentiment analysis than it is in other areas of NLP, because sentiment information is often sparsely and unusually represented — a single cluster of punctuation like :-( might tell the whole story. Thus, the representation and quality of data is first and foremost before running an analysis. The given dataset is subject to wide range of preprocessing before using it for classification purpose. The words in the tweets were tokenized and the following data preprocessing steps were performed before extracting the feature set from every tweet: Case conversion: All tweets were converted to lower case as string comparison and other data manipulation will be easier. Also, having the dataset in the same case does not give out any special substance. Hyperlink removal: The hyperlinks in the tweets were removed, as they did not serve any purpose in expressing a view or opinion. Repeated characters: Replaced 2 or more occurrences of a character in a word with exactly 2 characters. This way we will have less chance to miss the original word. E.g.: coffeeeeeeeeeeeeeee – Coffee
C. Results
Remove Special Characters: The special were removed from the data. But the emoticons were preserved, as they will be helpful in understanding the tweet better. Sequences of two or more periods are likely to be ellipsis dots and can be collapsed to...
We performed 5-fold cross-validation to test the accuracy of the classifiers. Naïve Bayes classifier produced an accuracy of 87% while SVM produced an accuracy of 90%. Random forest initially yielded 75% and we discontinued.
Stop words: English stop words were removed using nltk.corpus library provided by NLTK
II. ALGORITHM DESCRIPTION A.
Spell checker: We included a spell checker from NodeBox [2], which implements the Peter Norvig's algorithm to correct misspelled words. Each word is spell checked and those words whose
Data Preprocessing
Analyzing data that has not been carefully screened for such problems can produce misleading results. Tokenization is even 2
confidence is > 0.8 is replaced with the original word in the feature list.
tweet words belonging to the Noun, Verb, Adverb, and Adjective Parts Of Speech.
Twitter mark-up The #hashtag words were split into meaningful words. E.g.: #wickedwreaked to “Wicked", "Wreaked”. Only the splits with probability more than 0.50 are considered important. • The usernames in the tweets were removed as they did not give out emotions • Several single non-meaningful English characters are removed from the tweets. • Words with numeral start character are considered non-meaningful and removed.
Hypernyms and Hyponyms: Using WordNet dictionary, the highest frequency hypernyms and hyponym are retrieved for the tweet words belonging to the Noun, Verb, Adverb, and Adjective Parts Of Speech.
B.
Stemming: Stemming is a method for collapsing distinct word forms. This could help reduce the vocabulary size, thereby sharpening one's results, especially for small data sets. The Porter stemmer, one of the earliest and bestknown stemming algorithms is used. It works by heuristically identifying word suffixes (endings) and stripping them off, with some regularization of the endings. The algorithm stemmer often collapses sentiment distinctions, by mapping two words with different sentiment into the same-stemmed form.
Features
This step analyzes the linguistic features of the processed tweets so that important information can be identified. The following features are extracted and a vector is generated for every tweet. We aslo included tf-idf to generate a weighted matrix of the features.
Lemmatization: After considering the grammatical forms (ADJ, VERB, NOUN, ADV) of a word, the different inflected forms were grouped as a single item using Lemmatization.
Unigrams: The preprocessed tweet is tokenized and every single word is added to the feature vector. The feature vector is maintained as a list in Python.
C. Learning algorithms
Bigrams: The preprocessed tweet is used to extract top 100 bigrams and append them to the feature vector. We choose top 100 bigrams because the accuracy reached a saturation level with that. BigramCollocationFinder provided by nltk.collocation was used to extract this.
We tried the following NLP Classification algorithms: • Naïve Bayes Binomial • SVM • Random Forest III. EXPERIMENTAL RESULTS
Emoticons: Emoticons are extremely common in many forms of social media, and they are reliable carriers of sentiment.
A. Data Description We were given a total of 1834 test tweets with the attributes as described earlier. The data was different from the training data in the sense, the URLs were not in the short
Synonyms: Using WordNet dictionary, the highest frequency synset is retrieved for the
3
form and there were no duplicate tweets. The results of the test run are given below. B.
good scope for improvement. Some of the possible ideas to improve this classification problem are given below:
Test Results
Additional Punctuation: We have removed the extra punctuations however; keeping punctuations at the tokenization stage can be used to identify further structure in the tokenized string. The basic strategy for handling punctuation could be to try to identify all the wordinternal marks first, so that any others can be tokenized as separate elements. For example, sequences consisting entirely of digits, commas, and periods are likely to be numbers and so can be tokenized as words. Optional leading monetary signs and closing percentage signs are good to allow as well. The remaining punctuation can be kept as separate words. By and large, this means question marks, exclamation points, and dollar signs without following digits. I find that it works well to tokenize sequences like!!! Into three separate exclamation marks, and similarly for !?!? And the like, since the progression from! To !! is somewhat additive.
1. Classifier: SVM Accuracy: 84.62% Precision Recall F-Score Cessation 0.86 0.90 0.88 No Cessation 0.77 0.69 0.73 2. Classifier: Naive Bayes Accuracy: 81.92% Precision Recall F-Score Cessation 0.79 0.95 0.86 No Cessation 0.82 0.48 0.61 C. •
•
•
•
•
Features and their impact
In the data-preprocessing step, Stemmer and Lemmatization decreased the average accuracy decreased by 1-2%. So we did not include that feature in the final classifier. The usage of Bigrams did not increase our accuracy during training, however; when testing using the test data Bigrams were useful in increasing the accuracy by 0.5% The accuracy increased about 3% with the inclusion synonyms, hypernyms, hyponyms and spell checker. The use of emoticons did not have a significant impact on the accuracy of the classifier. 5 fold Cross Validation set increased our accuracy by 4% when the data was shuffled when compared to plain validation sets.
Capitalization: Preserving capitalization across all words can result in unnecessary sparseness. Words written in all caps are generally worth preserving, though, as they tend to be acronyms or words people intended to emphasize, which correlates with sentiment information. Negation marking: The rules of thumb for how negation interacts with sentiment words are roughly as follows: 1. Weak (mild) words such as good and bad behave like their opposites when negated: bad ≈ not good; good ≈ not bad. 2. Strong (intense) words like superb and terrible have very general meanings under negation: not superb is consistent with everything from horrible to just-shy-of-
IV. CONCLUSION AND FUTURE WORK We analyzed the performance of various classifiers to extract the linguistic features using mining methodologies. We feel there is 4
superb, and different lexical items favor different senses. These observations suggest that it would be difficult to have a general a priori rule for how to handle negation. It doesn't just turn good to bad and bad to good. Its effects depend on the words being negated. An additional challenge for negation is that its expression is lexically diverse and its influences are far-reaching (syntactically speaking). For example: I didn't enjoy it. As a remedy, append a _NEG suffix to every word appearing between a negation and a clause-level punctuation mark. Tweet: I don't think I will enjoy it: it might be too spicy. Convert to: I don't think_NEG i_NEG will_NEG enjoy_NEG it_NEG: it might be too spicy.
or POS-tagged strings, to dependency structures (and others). Such dependency can isolate not only what the sentiment of a text is but also where that sentiment is coming from and whom it is directed at. Sentiment lexicon via ER values: Many sentiment applications rely on lexicons to supply features to a model. A word w is positive if ER(w) ≥ 0, else negative. A word w's intensity is abs (ER (w)). Lexicon induction as feature set: Restricting attention to the features that have an interesting relationship to the metadata in tweets can be extremely effective in practice. The drawback might be limiting the classifier's ability to generalize to out-ofdomain data, where the associations might be different. It is possible to address this using the unsupervised methods to expand your initial lexicon to include more general associations.
Scope marking for attitude reports: Scope marking is also effective for marking quotation and the effects of attitude reports like say, claim, etc., which are often used to create distance between the speaker's commitments and those of others: Eg: They said it would be horrible, but they were wrong: I loved it!!! For quotation, the strategy is to turn it on and off at quotation marks. To account for nesting, one can keep a counter (though nesting is rare). For attitude verbs, the strategy is the same as the one for negation: _REPORT marking between the relevant predicates and clause-level punctuation.
V. REFERENCES [1] WordNet: wordnet.princeton.edu/ [2] Casey WhiteLaw, Ben Hutchinson, Grace Young, Gerard Ellis. Using the Web for Language Independent SpellChecking and AutoCorrection [3] NLTK is a platform for building Python programs to work with human language data: nltk.org/ [4] Scikit-learn - Machine Learning in Python: scikitlearn.org/stable/index.html [5] http://norvig.com/ngrams/ngrams.py WordSplitter - Peter Norvig Algorithm [6] A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts by Bo Pang and Lillian Lee. [7] Cristopher Potts. Sentiment Symposium Tutorial. Sentiment Analysis Symposium, San Fransisco, November 8th, 2011
Parts of Speech Tagging: It might be worth the effort to run a part-ofspeech tagger on the sentiment data and then to use the resulting word–tag pairs as features or components of features. Dependency Parsing: Dependency parsing transforms a sentence into a quasi-semantic structure that can be extremely useful for extracting sentiment information, particularly where the goal is to relativize the sentiment information to particular entities or topics. The Stanford Parse can map raw strings, tokenized strings,
5