A Brief Discussion About Utilizing High-Frequency News Data In Finance


In the last two decades a lot of endeavors have been made to develop tools in analyzing high-frequency financial data (e.g., transactions and quotes, or TAQ data). The outcome include a series of econometric models which enable us to analyze microstructure of the market, to estimate and forecast volatility on a high-frequency basis, to scrutinize relations among intensities of distinct activities, and so forth. These all enrich our knowledge about the market, and benefit practitioners who are primarily concerned about ever-better pricing and risk management.

We have also witnessed a growing amount of models which assimilate news data and direct toward financial applications. Yet in contrast to the rapid development in tools based on high-frequency financial data, little attention has been paid to erecting models in which high-frequency news data are exploited directly. Most models are grounded on news data aggregated across a fixed time window, which often spans more than a day and inevitably makes us overlook some microstructure traits of the data. To better understand how the press and the market interact, and what benefits can we derive from news data in financial applications, it is important for us to develop models which accommodate both news data and financial data under a high-frequency setting.

This is not a trivial task. The very first feature we would notice when glancing at high-frequency data is the irregularly spaced occurrence time. The time elapsed between two news items released by the media can be of any length, and so is the time elapsed between adjacent financial events (such as transactions and quotes). Such feature hinders the application of many extant models as they require input to be regularly spaced. Fortunately, there are heuristic approaches which might be extended to serve our goal. Duration model introduced by Engle and Russell (1998) or conditional intensity model propounded by, e.g., Russell (1999), may be taken as the starting point, since they are primarily directed toward irregularly spaced high-frequency data. It is not difficult to specify parametric models which capture some well-documented characteristics of financial data, such as volatility clustering and intra-day periodicity. Looking at firm specific news data we might also find some interesting diurnal patterns (for instance, conspicuous intra-day periodicity of news releasing time exists for some large US firms), and parametric models can mimic these patterns as well. Estimation and inference methods for these models are also partially available. With these models at hand, we can carry some interesting study on financial and news data under a high-frequency setting. For instance, one might probe how the processes of volatility and news-release varies with each other, which could cast light on trading and portfolio management.

Certainly there are many other obstacles to be overcome. News data are typically very “noisy”, in the sense that patterns which are informative for financial applications are hard to identify. Many news are also endogenous, that is, prior to their release the information they contain have already been assimilated in the market. To summarize, the task is two-fold: the development of econometric models accommodating high-frequency financial and news data, and the contrivance of effective rules of filtering and transforming news data. Once accomplished, the resultant will benefit everyone who seeks to better understand functioning of the market.


Ye Zeng is based at Aarhus University 2016-2019, and his research project is Modelling and Forecasting the Joint Distribution of Asset Returns with News (WP3).