Large, Unstructured, Noisy Data in Finance


Large, unstructured, and noisy textual data sources have become readily available as the web has grown and become a household commodity. Millions of product reviews, resumes, legal and corporate filings, blogs, news articles, governmental releases, (and more) may be freely downloaded and offer tremendous opportunities for unlocking new business value, enabling more effective communities, and supporting more accurate decision making.

Unfortunately, large, unstructured, and noisy textual data sources have three primary disadvantages that have made them difficult to work with, and helped them avoid wide-spread adoption:

They are LARGE,
and they are NOISY.

Certainly, with the advent of wide-spread and relatively affordable “elastic cloud” computing grids in the past decade, the problem of scale is not impossible to overcome, although some types of task, especially those that cannot be easily parallelized and require large amounts of fast local memory (e.g. graph building, complex dependent structures), can still pose challenges.

Our main focus is on the latter two problems–lack of structure, and  noisiness. The former stems from the inherent infinite property of language, able to generate an endless set of variations when referring to substantially equivalent concrete concepts. The latter stems from the real world consideration that most information (especially text) is still generated by fundamentally flawed and error-prone machines–human beings. People lack the coordination ability to decide on a single referent for each concept or relation, and thus generate noisy data, sometimes missing key pieces of the puzzle, making mistakes, or purposefully misleading the end consumer.

Since these data sources are becoming more and more commonplace, the need for a consistent and reliable set of methodologies and recommendations has also grown. Industries from publishing to finance, recruiting to sales and marketing, and even sophisticated governments are unable to make the most of the vast amounts of unstructured and semi-structured textual data they collect on a daily basis.

We ground our research in the realm of recruiting and the economics of the global labour force (commercial finance). We are interested in understanding the internal hierarchies of individual firms, how people transition from role to role, the skills that they acquire, and how intra-firm workforce movements can affect the financial performance of firms, innovation, industry-wide dynamics, and the economy more broadly. We also would like to better understand when and why people are likely to seek new employment, what the barriers are, and which skills are most in-demand. In the past, such explorations have depended largely on panel data from government surveys (usually restricted to particular industries or companies) such as \cite{gibbons1991layoffs}, or internal Human Resources data from a single large firm \cite{baker1994internal}.

We take a different approach, and instead use hundreds of millions of people’s employment histories from resumes and web-based employment profiles. However, each resume is formatted differently, uses different conventions, and may be in any of dozens of languages, and even the semi-structured employment profiles we have access to were manually created, and contain very noisy data, with the potential for many missing or erroneous entries. For instance, every individual is at liberty to write their company name, title, department, location, and skills in whatever way they like best. We are left with a vast number of context-less short strings.

As if to add insult to injury, the concepts that we are interested in working with: companies, roles, skills, etc., do not have readily available reference taxonomies that we can link to, and contain structures that make the problems quite different from entity resolution problems in the literature: when someone lists their role as “Sr. EA to the VP of Marketing (N.A.)”, we want to understand that their role is Executive Assistant, with a seniority marker indicating a more senior role than a simple Executive Assistant, and that their boss is a Vice President, in the firm’s Marketing department, with responsibilities for North American operations. That’s a lot of information, and represents one of tens of thousands of different representation choices.

Ultimately, to study the dynamics of movements in the workforce at the level of an entire economy, we need to move from individual job records to the massive intertwined graph of job transitions, and we need to use the large occurrence of job transitions to infer the correct internal structure of each firm. In order to build such a graph, we need clean disambiguated data. Conversely, in order to get clean, disambiguated data, we could really make use of distributions of roles and employment over the entire graph. This suggests an opportunity to iteratively refine both sides of the equation, and jointly learn how to solve these two problems optimally.

You can read more about our research into the global labor market dynamics in the paper: “Trading on Talent: Human Capital and Firm Performance“.