An introduction to sentiment analyses


The general idea behind sentiment analyses in finance is the existence of relevant but hard to quantify information in the textual data in addition to the objective statements of facts. This information could influence the sentiment of market participants which in turn could influence their actions. Therefore, quantifying sentiment is important to the extent that it can be informative about future events. Sentiment indexes have the potential to increase market efficiency. Incorporating sentiment into risk management tools or trading workflows should help to better identify and prevent capital misallocation. Here we go through the steps basic steps of creating a sentiment index.

The first step is to decide on the scope and source of inputs. The most general sentiment index would try to gauge the sentiment of the entire population over all possible topics combined. This would mean that there is no or minimal filtering of the input source. By increasing the filtering and tracking specific keywords or sources we go from general sentiment to industry specific and finally to company or product level sentiment.

The next step is to decide on the classification methodology. Here the options could be divided into two broad categories. Firstly, focusing on using lexicons and second on training your own model for identifying sentiment. A lexicon is a set of words with a sentiment score attached to them, a simple lexicon could be {good:1, bad:-1}. To apply such a lexicon, the document is scanned for the words “good” and “bad”, then their scores are summed up. If the final score is positive the document’s sentiment is positive and vice versa. The drawback of expert defined lexicons is that they might be defined for a different domain. For example, the Profile of Mood States is made to gauge the psychological state of a person and not their outlook on financial markets. In addition, what is considered positive in one domain could be negative in another. This can be overcome by creating your own lexicon. One of the simplest ways is to start off with a list of initial seed words and then expand it through a thesaurus or by scanning the dataset for the most frequent words that are used together.

Training your own sentiment model usually requires a labeled dataset. Some datasets are labeled by researchers, but this is time consuming considering that a dataset could contain millions of documents e.g. Twitter; thus, automatic labeling is frequently used; Twitter messages can be labeled according to the presence of smiley faces or specific hashtags, company reports – according to the increase or decrease in price after the report is made public. In rare cases, the authors of the documents provide the label. Once the labels are acquired, we have a traditional supervised machine learning task at hand. Naive Bayes classifier, Support Vector Machines, and Decision Trees are among the methods that have been successfully used for this task.

Whether the lexicon approach or the machine learning classifier is chosen, the final step is to evaluate the usefulness of the results. Here again, there are plenty of methods to choose from; from relatively simple correlations with financial indexes to complex rule-based trading strategies.

Rytis Simanaitis is based at The University of Manchester 2016-2019, and his research project is Identify Financial Mood Market Indexes (WP4)