The chaotic nature of financial big datasets requires in depth analysis of their properties. These properties vary from past information and signal filtering to statistical inference and arbitrage identification. There are some general approaches that one should take into consideration when dealing with big chunks of data.

To capitalize on the Big Data, information has to be extracted from all types of data. Data is either structured or unstructured; in particular, 30% is the structured and the rest 70% is unstructured or semi-structured. One major issue that should be taken into consideration with respect to large data sets is a blind source separation problem, also known as the cocktail-party problem as discussed by Gresham and Oransky (2008). Imagine a room crowded with m people speaking simultaneously and in clusters, while several microphones are recording their voices. The question is how we are to view the individual raw signals in isolation so that further extraction of data can be accomplished. How are we, in other words, to obtain less noisy and, therefore, more accurate data?

There are several models that can be employed in order to identify the data signal, as for instance the Wienner-Kolmogorov signal extraction filter as suggested by Pollock (2006). This filter minimizes the mean square error – the distance of a fitted line from data points – but cannot deal with mixed distributions. Another common method in econometrics is Kalman filters (KF) as discussed by Jay, Duvaut and Drolles (2013). The main advantage of this method is that it copes with missing data – which is the absence of information on model variables. The main disadvantage of the latter is that it depends on the covariance matrix. A robust indicator of a successful data extraction would be the existence of clusters, or “stable regions” as denoted by Tomasini and Jaekle (2009). More specifically, input parameters like median, mean, and standard deviation should have similar values in the in-sample (i.e. training part of the data) and out-of-sample (i.e. the future projection of the selected data) validation.

Backtesting (in-sample and out-of-sample testing) is the simulation of a trading strategy based on data signals as well as historical data and, it is essential in estimating Value-at-Risk (VaR). Modelers should take into consideration the objective of their strategy and accordingly choose a method. Backtesting methods vary from the most common, delta-normal method, which assumes that all assets returns follow a normal distribution according to Jorion (2011); historical simulation, which applies current weights to historical asset returns as presented by Sommacampagna (2003); or Monte Carlo simulation, based on random numbers, to more complex, such as filtered historical simulation, which scales returns according to their volatility as stated by Barone-Adesi and Giannopoulos (1996), and CTSARMA-GARCH model, which describes effectively the skewness and fat tails of a distribution as discussed by Carchano, Kim, Sun, Rachev and Fabozzi (2015).

Despite limitations such as overfitting or noisy data, to name a few, backtesting stands out as an essential component of Big Data analysis and certainly calls for further research.

__References__:

Gresham, Steve D. and Arlen S. Oransky (2008). The New Managed Account Solutions

Handbook: How to Build Your Financial Advisory Practice Using Managed Account Solutions.

1st edition. New Jersey: Wiley.

Pollock, D.S.G. (2006). “Econometric Methods of Signal Extraction. Computational Statistics &

Data Analysis”. Vol. 50. Issue 9. 2268-2292.

Jay, Emmanuelle, Patrick Duvaut and Serges Darolles (2013). “Multi-Factor Models and Signal

Processing Techniques: Application to Quantitative Finance”. 1st edition. London and New

Jersey: Wiley-ISTE.

Tomasini, Emilio and Urban Jaekle (2009). Trading Systems: A New Approach to System

Development and Portfolio Optimization. Hampshire: Harriman.

Jorion, Philippe (2011). Financial Risk Manager Handbook: FRM Part I / Part II, + Test Bank.

6th edition. New Jersey: Wiley.

Sommacampagna, Cristina (2003). “Estimating Value at Risk with the Kalman Filter”. IV

Workshop in Finanza Quantitativa. Available at:

http://www.finanzaonline.com/forum/attachments/econometria-e-modelli-di-tradingoperativo/

836347d1201361672-isteresi-unopportunita-o-una-difficolta-cs1.pdf. [Accessed:

7th October 2015].

Barone-Adesi, Giovanni and Kostas Giannopoulos (1996). “A Simplified Approach to the

Conditional Estimation of Value-at-Risk”. Futures and Options World. 68-72.

Carchano, Oscar, Kim Young Shin, Edward W. Sun, Svetlozar T. Rachev and Frank J. Fabozzi

(2015). “A Quasi-Maximum Likelihood Estimation Strategy for Value-at-Risk Forecasting:

Application to Equity Index Futures Markets”. In Handbook of Financial Econometrics and

Statistics. Ed. Lee Cheng-Few, Lee John C. Springer.1325-1340.

*Adamantios Ntakaris is based at Tampere University of Technology 2016-2019, and his research project is **Divide and Conquer Deep Learning for Big Data in Finance** (WP1)*