Opinion mining the Twitter way
In a previous article, I mentioned how automated opinion mining (sentiment analysis) can take two approaches, dictionary based or linguistic processing — analysing the language of the content — or a statistical, knowledge-poor data-rich, approach. In this article, I will summarise how Twitter implemented this analysis while having to take into consideration the scale of the production deployment. As a side note, I added a table of technical term explanation at the end of the article.
Sentiment analysis as a classification problem
In its simplest form, sentiment analysis can be defined as classifying content polarity- its probability to fall within two buckets or potentially on the spectrum where the two ends are 100% probability of class membership. I’m deliberately using class not positive or negative sentiment to emphasise the point that sentiment is one type but not the only type of classification. Class can be any two labels Cold/Hot , Small/Big, Topic1/Topic2. In addition, using two classifications does not limit generality as multi-topic classifications can be thought of as multiple binary classifications executed in sequence.
Back to the topic, the objective is then: “given a set of tweets that are known in advance to express some opinion, use this data to predict the opinion of unlabeled tweets”. This objective falls under what is termed supervised machine learning. As a shortcut to manual labeling of tweets, Twitter used emoticons to infer the sentiment in a tweet which is then used as a label for that tweet. This is a useful trick when processing millions of tweets keeping in mind the bias this process introduces. To get a sense of the volume of data in this experiment, Twitter used a test set of 1 million tweets and training sets of 1 million, 10 million, and 100 million tweets.
The classification model was based on a simple logistic regression classifier (with stochastic gradient descent) applied to a feature-set based on hashed-byte 4-grams. To make the point that language was not considered, the document was processed through a feature extraction process: tweets were considered as a long string of numbers divided into four bytes with a sliding window – words were not even tokenized. Each four bytes were hashed. The hash was then used as a feature id. The values of feature-columns were binary yes/no; so multiple occurrences were counted once.What is interesting is the model does not need to know what the language is. Indeed, filtering on English words is for convenience.
During the experiment, Twitter applied two different learning modes:
- a model based on a single logistic regression learner
- an ensemble model based on multiple logistic regressions in parallel
and accuracy was measured by evaluating the precision and recall of each case. What is interesting is that without any knowledge of the content, the results were better the more data used in training; and reached accuracy of 82% with 21 of these classifies in group. Not only that but even with only 1 classifier, the system achieve 79% accuracy — better than most social media analytics tools currently on offer.
Analytics in a big data context
The results are impressive but real challenge for Twitter was not the sentiment classification itself but how to perform this analysis at scale in a big data production environment. While the assumption so far has been that big data will solve the analytics complexity at scale, what this shows is that, in fact, the analytics have to be adapted to the big data system to work at scale. This adaptation impacts both the algorithm chosen and the way the algorithm would integrate with the existing system.
The algorithm had to be able to work with partial data or data in chunks and preferably with streams of data as they become available, hence the choice of a logistic regression learner with stochastic gradient descent optimisation function. This is of particular relevance because, in big data systems, transferring data between different processors is expensive and slow. As for the “big data” part, Twitter relied on Pig/Hadoop for it’s business analytics and wanted to embed the predictive analytics into the existing system. Pig by design is not very amiable to machine learning algorithms and the ingenious part in what Twitter did was in using what Pig excels at – large data processing – while adding the sentiment classification as custom blocks within the Pig pipeline so that these custom blocks form the core of the process.Two elements of work-flow are particularly interesting. By selecting models that can work with partial data, the learning models did not have to look at all the data which makes integrating them in the process easier. Equally important, by using HDFS to store the intermediate model parameters, again, the sentiment classification was easier to integrate in the work-flow. The actual model implementations were based on internally developed machine learning Java libraries developed at Twitter. The paper goes into the details of the implementation in case you would like to replicate the system.
The challenge is not in the technology
Despite all the technical complexity and detailed solutions, it was reaffirming to read from Twitter that the creative part is not in the big data technology but in the formulation of the problem. Technology does not select why/which question to ask the data and there is no substitute for domain understanding to generate insight.
At the same time, broader adoption of such analysis is not limited by technology but by readiness of people to embrace truly data-driven decision making; the cultural challenge is orders bigger than the technical one.