Sentiment Analysis – understanding the source of the controversy
Accuracy of sentiment analysis has been a hot topic in social content analysis fuelled by the inflated expectations created by overzealous marketing from social media monitoring tools. But, long before people were sharing their latest meal picture on twitter (that’s the 1980s), academics and businesses were interested in analysing the large volume of textual data available to them. From the multiple sub-fields of text mining, sentiment analysis proved to be one of the more challenging to implement in computers due to the complexity of human language. Indeed, one of the definitions of Language is “… a system of communication that enables humans to exchange verbal or symbolic utterances.” Computers find this latter part, “symbolic utterances”, challenging because it builds on the implicit common knowledge that a person develops over a lifetime of exposure to symbolism and use of the language.
In this post I aim to provide a sense of why sentiment analysis is inaccurate, why the results differ between different tools, and what does this mean to you.
Sentiment Analysis (Opinion Mining) – a computer perspective
Of the two main definitions of “sentiment”, the one we are interested in is: “a view or opinion that is held or expressed.” as opposed to “Exaggerated and self-indulgent feelings of tenderness, sadness, or nostalgia.” Hence, the process is more accurately named Opinion Analysis/Mining.
Computers see documents as long lists of characters (including spaces, punctuation, etc.). This long list of characters has no meaning to a computer without a human-imposed structure to interpret it – an important point that is obvious but is easily forgotten. The two main families of algorithms that have been used to help computers extract meaning from text are based on keyword dictionaries (KD) or supervised training for statistical pattern detection (HT). The two areas are not totally disjoint but the distinction helps with the explanation.
KD based automated analysis
KD algorithms rely on predefined lists of word-sentiment pairs that tell the computer if a word is negative, positive, neutral and to what degree. For example “happy” can be +1 while “disastrous” can be -3 and “are” can be 0. The simpler algorithms, the ones implemented in many social media sentiment analysis tools, sum up the values in a sentence to come up with an aggregate score that is positive, negative, or neutral. This approach is very basic and can’t capture the complex expressions like in “this movie is disastrously funny”. Overall, the results are inaccurate and the output is not so informative. Counter-intuitively, academic research has shown that trying to manually improve the accuracy of KD algorithms to get more accurate opinion assessment gives even less accurate results. The reason turned out to be that we use signals of sentiment that are different from what we think we use. For example the “?” and “Still” in the sentence “What was the director thinking? Still, though, it was worth seeing. ” are some of the indicators of tone of voice for this sentence; however, in the research, these signals were not chosen by the analysts when manually selecting words that indicate sentiment.
More advanced algorithms build on detailed domain and topic knowledge to customise the dictionaries. These algorithms provide a more nuanced analysis of sentiment – detecting anger, frustration, support, advocacy, etc. but they still suffer the limitation of the dictionaries they run from. The most accurate ones rely on domain and case specific dictionaries generated by the researchers using complex selection criteria.
The most advanced algorithms such as the Recursive Deep Learning for Sentiment Analysis from Stanford University relies on complex (deep) sentence structure analysis (shown in the figure below) and still depend on human training to “learn” better estimations of sentiment. Very few tools implement such algorithms at scale.
The best KD algorithms aim to achieve 80% accuracy. The actual accuracy is highly dependent on the target text and can be low for less simple documents. It is also dependent on having the dictionary for the language of the text (i.e. Mandarin Chinese, Arabic, Hebrew,…)
The second family of algorithmic text analysis relies on humans to tell the algorithms how to label a group of words/sentences. The algorithms look for what is similar/different (statistically) within each group of trained examples, and within the set of groups, to infer a pattern. This pattern is used to label any new content the algorithm receives. Historically, this approach has been used to cluster documents based on “factual information”, i.e. labelling documents as “sports article” or “political article”. Recently the same approach has been used to cluster opinion (sentiment) i.e. “happy about the product”. The main benefit of HT algorithms is illustrated in the example below.
The figure above provides a helpful example to illustrate the difference between the two families of algorithms. Your brain takes some seconds to learn the meaning of the symbols in the figure and then you can read and understand the content relatively easily. A KD algorithm, which was not adapted to this content, will see this example as gibberish. While an HT algorithm relies on your brain to interpret this content and tell the algorithm what it means (assuming no implementation-specific limitations) – you can label this example as “amazement-positive”; the algorithm can then go and look for this pattern to label the rest of the content you present to it.
Even though this is a tongue-in-cheek example, it helps contrast the difference. HT algorithms rely on us making sense of the content, labelling it as we wish, and let the computers do what they are good at – number crunching. This approach delivers a much higher accuracy, up to 97%, but requires a time investment to choose the best possible examples to train the algorithm. The accuracy is also highly dependent on the quality of the training (and hence the person doing it) and the training set; if the new content exceedingly differs from the training set, the accuracy is lost.
Fully automated vs HT
From a computing perspective, KD algorithms are “smarter” as the computer is doing more of the analysis to identify sentiment while HT algorithms depend much more on the human trainer. From a results perspective, KD algorithms are limited by how accurately the computer can interpret human language while HT algorithms provide more flexibility and more accuracy while needing more setup effort.
From a business perspective, comparing the accuracy of the two classes of text analysis algorithms is less useful as opposed to knowing where to use each one. The first is useful if you want to capture highly opinionated posts in real-time to identify potential problems while keeping in mind that the aggregate totals are not reflective of the actual customer opinion. The second, when well trained and focusing on a business objective, is useful to assess a large set of content for detailed sentiment (opinion) clustering and, therefore, can be an accurate representation of the opinion in the dataset. At the same time, HT algorithms are less useful for ongoing monitoring since the accuracy drops as the conversation drifts away from the patterns of the initial training set and frequent interpretation and re-training is required.
What does it mean to you
I’ll start with the premise that, as business users, very few of us are interested in social media for its own sake but rather interested in what business value we can extract from it through analysing opinion, people, networks and, yes, monitoring mentions for PR purposes.
The important takeaway is that no single, currently available, approach or tool provides a complete solution to address these needs. Not only that, but also placing social media content analysis in a silo removes the opportunity of linking the analysis to business objectives and needs. Without an understanding of the business, we get colourful dashboards that do not close the decision gap between what we want to know and what information we currently get. Until Artificial Intelligence achieves the level required to learn human knowledge and use it to understand context and business (it’s not far based on what IBM’s Whatson is doing), we still need to rely on humans, and not computers, to understand the objectives and to extract meaning out of the available data. We need to approach sentiment (opinion) analysis as we would any other data analysis project by understanding the business, defining the needed outputs, and, then, selecting the tools that help us deliver and interpret these outputs. It’s easy to buy tools but it’s much more difficult to get the expert who can extract value from any dataset – not only social media content.