Why Financial Time Series LSTM Prediction fails

and when it might just work.

After the Writing like Cervantes appetizer, where a LSTM neural network ‘learnt’ to write in Spanish in under a couple of hours (an impressive result, at least for me), I applied the same technique to Finance.

This is what I learnt:

View in Colaboratory (the notebook with the code)

Time Series prediction with LSTM

In Writing like Cervantes I showed how LSTM-NN (Long Short Term Memory Neural Networks) show what a fellow blogger says is an “[unreasonable effectiveness(http://karpathy.github.io/2015/05/21/rnn-effectiveness/)”. If you see the example I developed, you will see how the neural network ‘learned’ to write in Spanish (letter by letter) in a little under 2 hours after reading the whole ‘Don Quijote’.

Not surprisingly, LSTM NN have been proposed to predict Time Series.

Spoiler alert: Most of them do not work in Finance (so far I have only seen one that claims to work: Universal Features of Price Formation in Financial Markets: Perspectives From Deep Learning — Sirignano and Cont)). That could be because:

  1. practitioners who haveimplemented a working version of a neural networkwould use the whole extent of the law to keep it secret ( Goldman suing ex-programmer) and avoid the ‘alpha destruction effect’ that happens when a strategy is published Does Academic Research Destroy Stock Return Predictability?
  2. The timeseries cannot be predicted, because as as Sirignano and Cont write, “the data used for estimation is often limited to a recent time window, … financial data can be‘non-stationary’ and prone to regime changeswhich may render older data less relevant for prediction.
  3. The timeseries can be predicted, but the practicioner/researcher does not have access to the full dataset/ computer power.

Below (after some discussion of points 2 and 3) I will show in full detail an example of time series prediction of the 5 year US rate.

More about non-stationarity and regime changes — you can read the formal definition in Wikipedia non-stationary], but let me give you a real life example.

In September 6 of 2011 the Swiss National Bank:

set a minimum exchange rate of 1.20 francs to the euro (capping franc’s appreciation) saying “the value of the franc is a threat to the economy”,[18] and that it was “prepared to buy foreign currency in unlimited quantities”

a policy that was abandoned on 15 January 2015.

The period when the Swiss Franc was capped versus the Euro correspond to the shaded region in the plot below. The big moves before and after correspond to regimes changes. The argument against neural networks being unable to identify them usually assumes that there is no additional variable that can be used to identify them, hence it can be weakened if we are able to add input variables which identifies the regimes (in this case, central bank intervention).

EURCHF — SNB intervention during shadowed range

However, it is difficult to think ahead of the variables needed for prediction (it is always easier in hindsight), and if there are multiple regimes the data will be subdivided over and over, reducing the possibility of successful training (mental exercise: think what would happen if we take the ‘Don Quixote’ example mentioned above and train the system with a mixed bag of small blogs in different languages instead of a huge book in one language).

Sirignano and Cont claim they can make it work by:

  • using huge amounts of data: Our data set is a high-frequency record of all orders, transactions and order cancellations for approximately 1000 stocks traded on the NASDAQ between January 1, 2014 and March 31, 2017” “In electronic markets such as the NASDAQ, new orders may arrive at high frequency — sometimes every microsecond — and order books of certain stocks can update millions of times per day. This leads to TeraBytes of data, which we put to use to build a data-driven model of the price formation process.
  • massive computer power: “Approximately 500 GPU nodes are used to train the stock-specific models.
  • LSTM Neural Networks: “The resulting LSTM network involves up to hundreds of thousands of parameters. This is relatively small compared to networks used for instance in image or speech recognition, but it is huge compared to econometric models traditionally used in finance

I have a pet peeve for irreproducible research, and lacking access to terabytes of limit order data hundreds of GPU’s I cannot comment on the the paper.

But a very important factor that could explain the success of this approach, is the use of the whole public view of the Limit Order Book (LOB).

Think about it: above I showed how regime changes worked in the case of the EURCHF; after the announcement, the ‘floor’ of 1.20 CHF to EUR was in effect an unlimited limit order. In fact, the limit order does not have to unlimited to have an effect on following prices; as long as the limit order is ‘large’:

Large limit orders can be “front-run” by “order matching” or “penny jumping”. For example, if a buy limit order for 100,000 shares for 1.00 is announced to the market, many traders may seek to buy for 1.01. If the market price increases after their purchases, they will get the full amount of the price increase. However, if the market price decreases, they will likely be able to sell to the limit order trader, for only a one cent loss. [wiki]

Hence, by choosing the whole public Limit Order Book as input to the time series, Sirignano and Cont are using an additional set of features that can provide useful extra information.

Note: Before trying to use neural networks to deal with time series it is worthwhile to ask whether we have ‘enough’ data and computer power (which is not the case of most lonely day traders and academics who publish on the web and scientific papers).

How could it work ?

Still, I find value in negative results (the dark matter of research ), so for my sake I decided to try a simple Time Series example ‘warts and all’.

In a previous example of AI in TimeSeries I showed how an unsupervised learning technique (clustering) can be used to identify regime change. But if you click on the example, you will see that having a timeseries with about 2.5k (250 business days for 10 years) datapoints divided in regimes that range from months to years provides very little data for a big neural network like the ‘Don Quijote’ (which as more than 1 million parameters to train).

There are quite few sources of material to use LTSM for prediction, but I particularly liked one class in courseraCrude Oil price prediction

Another usefulresources are:

The following code is a copy of the coursera course, but I will try to give some insights.

The first thing I do is to use USD rates — just to try something different and check whether the results make sense (I do that because I am more familiar with rates, but also because the data is free and does not require me publishing user id’s and passwords)

Visually, I can see that values range between 0 and 10%.

Note: (first wart): pre-defining the range with all introduces look ahead bias, even if for rates it can be claimed the near future will still range between this values. But so far I have seen this error multiple times.

Below you can see the data is represented as an array of consecutive days.

Note: (2nd wart) this techniques removes useful information that could be relevant — is it a weekend, end of quarter, month (weekend effect) ? We could add it back as a ‘feature’

Here we start entering some ‘hyperparameters’:

  • batch size and epochs tell the system how fast and how many times to train
  • training timesteps tell the system how far back in time to go during training.

Note: (3rd wart) The hyperparameters will impact the accuracy of the predictor — and if we only have a small set of data it might be possible that playing with different hyperparameters gives us an ‘overfitted neural network’ that will work very well just in the available set but awufully afterwards.

Note: I have seen that the training timesteps are confused with the amount of memory used (as if the system as an ARIMA model) — but it is not the case. LSTM will remember well back in the past. Read

Note: (1st wart again) — the MinMaxScaler introduceslook-ahead bias. The inputs of the neural network need to fall within the range 0 to 1, and just below we used the whole available data set to scale the input.

How about if want topredict stock values (that can rise forever) ? Well, you need to somehow transform your data into something that falls between 0 and 1 — that is why Sirignano and Cont model the probability that the next move is up or down — a number between 0 and 1:

The models therefore predict whether the next price move is up or down

Now we setup the architecture of the LSTM neural network: 4 layers, 2 of them LSTM with 10 neurons each and a Dense final one.

Note (3rd wart again) — we have more hyperparameters here — layers and neurons — how do we define them ? We can modify them until a network has a better accuracy, but we fall into overfiting — that is why need more and more data — we need different testing sets for each architecture to avoid overfitted models.

Note Notice that here (as opposed to the ‘Don Quijote’ example) the LSTM add a stateful = ‘True’ command — that means the network will have Long memory.

In [0]:

# Initialising the LSTM Model with MAE Loss-Function
# Using Functional APIinputs_1_mae = Input(batch_shape=(batch_size,timesteps,1))
lstm_1_mae = LSTM(10, stateful=True, return_sequences=True)(inputs_1_mae)
lstm_2_mae = LSTM(10, stateful=True, return_sequences=True)(lstm_1_mae)output_1_mae = Dense(units = 1)(lstm_2_mae)regressor_mae = Model(inputs=inputs_1_mae, outputs = output_1_mae)regressor_mae.compile(optimizer='adam', loss = 'mae')
Layer (type) Output Shape Param #
input_1 (InputLayer) (64, 30, 1) 0
lstm_1 (LSTM) (64, 30, 10) 480
lstm_2 (LSTM) (64, 30, 10) 840
dense_1 (Dense) (64, 30, 1) 11
Total params: 1,331
Trainable params: 1,331
Non-trainable params: 0

It is useful to compare the number of parameters to train (1.3k) with the number of data points (6.6k). Compared to the Don Quijote example were we had roughly the same number of parameters and hyperparameters, is a ‘good’ sign.

Note (2nd wart) — the input layer uses ‘batches’ of 60 points (to make training faster), 30 backward looking training steps and 1 feature. Above we mentioned that we had discarded information — here we could add it again in the form of another feature, butthat would increase the number of parameters to train while our data has notincreased.

Neural networks seems to have escaped the curse of dimensionality (the curse says that “the number of training instances needed grows exponentially with the number of dimensions” link), but still it does not mean you can use a few examples to train a huge neural network.

Below I will start the training of the network -please keep scrolling down to see the results.

Results — finally

Well, they are pretty poor.

The graph above does not look too bad because the prediction at least falls ‘close by’ to the last seen level. However, the prediction is done only for 1 step — the series is constructed by adding the correct value to the series at each point once it is known for the next day prediction, and even then the prediction has a downward bias which would have cost dearly any trader.

Notethe test results have a big bias (5bps undershoot according to the histogram). This is the part where either someone would be tempted fix the bias by making some changes to the hyperparameters (add neurons ? add lags ? etc.) or would train different models in parallel until we find a model which performs better in this final step. That is a mistake — this ‘out of sample’ is supposed to be the ‘proof of the pudding’ — you cannot reuse it to have another go at the series. That is why you need lots more data so that you can keep and untouched test set on hand for every model you want to try.

What did I learn ?

My goal was to show how these LTSM could be used.

Each one of the ‘warts’ can be addressed and improved when more data (and computer power) is available:

  • wart 1: look ahead bias — this requires having a good domain knowledge.
  • wart 2: removing useful information — this requires both the domain knowledge + more data
  • wart 3: hyperparameter selection — need even more data to be able to test each different model.

In three words: data, data and data.

Note — it is common to think that a ‘lot of data’ can be overcomed by adding more data points per example — this timeseries has only one datapoint per day, but maybe we could add lots of other related datapoints the same day ? If we were to do that we would be adding input points to the neural network and would increase the trainable parameter, but we will still have the same number of examples. Put another way — think that you have 10 pictures to train a cat identifier — the system will not improve if you increase the resolution of your pictues to HD instead of low res — you need morepictures.

Leave a Reply

%d bloggers like this: