Twitter Sentiment Analysis With XGBoost.

Blessing Magabane
6 min readMay 19, 2020

--

Sunset.

Introduction

We live in an era were fake news have become main stream. The wide spread of fake news is due to social media. More people than ever are getting their news from social media than traditional platforms. Notable of those is Facebook and Twitter, both platforms have come under investigation in recent years for spreading fake news.

In this blog we are going to demonstrate how NLP (Natural Language Processing) techniques can be used to separate real tweets from fake tweets. Our source of data will be tweets extracted from twitter. We will be using XGBoost as a classifier to complete the task of distinguishing real from fake tweets.

Datasets

We will be using the data from the Kaggle competition,

The data provided is divided into, training and testing data.

Each sample in the train and test set has the following information:

  • The text of a tweet
  • A keyword from that tweet (although this may be blank!)
  • The location the tweet was sent from (may also be blank)

Columns

  • id — a unique identifier for each tweet
  • text — the text of the tweet
  • location — the location the tweet was sent from (may be blank)
  • keyword — a particular keyword from the tweet (may be blank)
  • target — in train.csv only, this denotes whether a tweet is about a real disaster (1) or not (0)

The main goal is to predict whether a tweet is a real or fake disaster.

Model

In this section we clean the data and train the model using the training data. The training data has 7613 tweets with varying text length. Similarly, the testing data has 3263 tweets which we need to classify into real or fake disaster.

Text data is unstructured making it difficult to work, especially for machine learning purposes. Machine learning algorithms are optimized when they work with numerical data it is for this reason, we need to move from text to numerical. We achieve that by normalizing the text and embedding it into a numerical vector. This process can be fairly complicated depending on the structure of the text you are working with. For example, a sentence that’s full of asterisks and special characters can make the cleaning process tedious.

Standardizing the tweets

Tweets are full of asterisks and special characters which makes the classification process tedious. In order for algorithms to classify the tweets we need to vectorize the text. We do this by removing the hashtags and special characters. Characters that are common in tweets are the hashtags and the at signs.

Below is a screen print of the cleaning process, you can follow along with my code on

We also need to remove unwanted patterns in the tweets to preserve consistency in the text. Such a pattern is unbalanced spaces between words in a sentence or a tweet in this case. Once we have achieved that we are a step closer to vectorizing the text. Before we vectorize the text we need to tokenize it and then normalize it.

Screen print showing the extraction of hashtags from tweets.

Once the text in the tweets has been normalized we can visualize it. We can now see the common hashtags for real disaster and fake disaster. We use word cloud which is a library to visual the text on python.

For real disaster we have the following words being common,

and similarly for fake disaster we have the following words.

Word Embedding

Word embedding is the process of converting text into a vector of numerical values. There are different ways in which can define our vector, however in this blog we are going to use bag of words but it is worth noting that other methods do exist.

The screen print below shows the process of performing word embedding for bag of words.

Now that we have cleaned the text, normalized and embedded it we can now move on to feeding it to a machine learning algorithm. We are now in a position to build a model and start classifying tweets into real disaster or fake disaster.

Train the model

There is a number of algorithms that we can use to train our model but in the tutorial we are going to use XGBboost. XGBoost is an implementation of gradient boosted decision trees designed for speed and performance. XGBoost stands for eXtreme Gradient Boosting. XGBoost performs better than most predictive models. It for this reasons that we are going to be using it to classify our tweets.

The code implementation is shown below.

We get a score of 73.46% which is not bad for first attempt. We can improve the score by optimizing the algorithm. One way of achieving that is through hyperparameter tuning. But in this blog we will just leave the current score.

The above score is solely based on the training data, we are interested on how the model works on unseen data. If we apply our model to the test data as shown below we get a score of 76% which is better than the training result.

Testing the model on covid-19 tweets

In this section we are going to test our model on covid-19 tweets and analyze the sentiment. We just want to see how are people reacting to the virus on Twitter. The process to cleaning the tweets is the same as above. The covid-19 tweets can be found on the following link - http://www.apsense.com/article/top-20-historical-twitter-datasets-available-for-download.html .

We are only going to classify the first 10 tweets.

Our model classified most of the tweets as fake disasters. But it makes sense for model to produce such results since it has been only trained on actually disaster scenarios. It recognises key words such as floods or volcano as real disaster.

Conclusion

We have shown how you can use NLP to analyse sentiment for tweets. In this blog we only used XGBoost to classify the tweets, in future we can perform more advanced techniques such the use of transformers in tensorflow and word embedding using BERT. We have shown how easy it is to classify tweets into real or fake disaster. This whole process only took a week to complete.

It is also important to note that it was easy to complete the modeling part, however it required a lot of time and effort to complete the cleaning process. NLP problems are known for this, the data is always unstructured and filled with unwanted patterns.

References

  1. http://www.apsense.com/article/top-20-historical-twitter-datasets-available-for-download.html .
  2. https://www.kaggle.com/c/nlp-getting-started

--

--

Blessing Magabane

A full stack Data Scientist with experience in data engineering and business intelligence.