6

I have a very simple machine learning code here:

# load dataset
dataframe = pandas.read_csv("USDJPY,5.csv", header=None)
dataset = dataframe.values
X = dataset[:,0:59]
Y = dataset[:,59]
#fit Dense Keras model
model.fit(X, Y, validation_data=(x,y_test), epochs=150, batch_size=10)

My X values are 59 features with the 60th column being my Y value, a simple 1 or 0 classification label.

Considering that I am using financial data, I would like to lookback the past 20 X values in order to predict the Y value.

So how could I make my algorithm use the past 20 rows as an input for X for each Y value?

I'm relatively new to machine learning and spent much time looking online for a solution to my problem yet I could not find anything simple as my case.

Any ideas?

xion
  • 89
  • 2
  • 10
  • So what you want is to use the 20 most recent values to predict the next unknown Y value? – DarkCygnus Aug 18 '17 at 21:02
  • 2
    +1 RNN is your best bet, I had an answer very close to djk's, but they covered the major points already. 1 thing a link to RNN in Keras: https://keras.io/layers/recurrent/#simplernn Also, 1 potential alternative is to use a running sum or another function on each consecutive 20 observations of your data set so when you run another model on the resulting set, those 20 observation's information is present. – AChervony Aug 18 '17 at 21:23
  • @GrayCygnus To be clear, I want to use the past 20 rows (including the current row of the y value I want to predict) to predict the current Y value. Thanks! – xion Aug 18 '17 at 22:04
  • @AChervony Thanks! Could you also post your answer? More examples really help – xion Aug 18 '17 at 22:08

2 Answers2

8

This is typically done with Recurrent Neural Networks (RNN), that retain some memory of the previous input, when the next input is received. Thats a very breif explanation of what goes on, but there are plenty of sources on the internet to better wrap your understanding of how they work.

Lets break this down in a simple example. Lets say you have 5 samples and 5 features of data, and you want two stagger the data by 2 rows instead of 20. Here is your data (assuming 1 stock and the oldest price value is first). And we can think of each row as a day of the week

ar = np.random.randint(10,100,(5,5))

[[43, 79, 67, 20, 13],    #<---Monday---
 [80, 86, 78, 76, 71],    #<---Tuesday---
 [35, 23, 62, 31, 59],    #<---Wednesday---
 [67, 53, 92, 80, 15],    #<---Thursday---
 [60, 20, 10, 45, 47]]    #<---Firday---

To use an LSTM in keras, your data needs to be 3-D, vs the current 2-D structure it is now, and the notation for each diminsion is (samples,timesteps,features). Currently you only have (samples,features) so you would need to augment the data.

a2 = np.concatenate([ar[x:x+2,:] for x in range(ar.shape[0]-1)])
a2 = a2.reshape(4,2,5)

[[[43, 79, 67, 20, 13],    #See Monday First
  [80, 86, 78, 76, 71]],   #See Tuesday second ---> Predict Value originally set for Tuesday
 [[80, 86, 78, 76, 71],    #See Tuesday First
  [35, 23, 62, 31, 59]],   #See Wednesday Second ---> Predict Value originally set for Wednesday
 [[35, 23, 62, 31, 59],    #See Wednesday Value First
  [67, 53, 92, 80, 15]],   #See Thursday Values Second ---> Predict value originally set for Thursday
 [[67, 53, 92, 80, 15],    #And so on
  [60, 20, 10, 45, 47]]])

Notice how the data is staggered and 3 dimensional. Now just make an LSTM network. Y remains 2-D since this is a many-to-one structure, however you need to clip the first value.

model = Sequential()
model.add(LSTM(hidden_dims,input_shape=(a2.shape[1],a2.shape[2]))
model.add(Dense(1))

This is just a brief example to get you moving. There are many different setups that will work (including not using RNN), you need to find the correct one for your data.

DJK
  • 8,924
  • 4
  • 24
  • 40
  • Makes sense! Thank you! However, I have a few questions. When you mean stagger, do you mean using the past 2 rows (including the current row) to predict the current y? Because for each y I'm currently predicting on, I'd like to use the past 20 rows for example. Secondly, I'm confused on what I should put into the "hiddent_dims" if anything? – xion Aug 18 '17 at 22:01
  • Also I don't understand 4 is used as a value in the "a2 = a2.reshape(4,2,5)". If I'm correct, the 2 stands for how many rows I'm staggering and the 5 stands for the number of features but what does the 4 stand for? – xion Aug 18 '17 at 22:12
  • 1
    Try to follow the logic I added as comments in the code. And think about how the logic applies. The `hidden_dims` is just the number of nodes in that layer, that's something you decide, a number > 1 (typically it is the number of input values, but up to you). In `reshape` we use 4 because there is one less total sample. There is no data before the first row, so there is no way to stagger that data. So essentially we drop it, leaving us with on less sample – DJK Aug 18 '17 at 22:20
  • Alright I got it! But one more last question. Since we are dropping that first value, wouldn't I have to do the same to my y values when I read in those or is it fine to leave the y column as is while I manipulate the x data? – xion Aug 18 '17 at 22:33
  • 1
    Thats ok, Ask away! Yes you need to drop the first y value in this example. But in your case you would drop 19 since you have 20 time steps – DJK Aug 18 '17 at 22:44
  • Thanks! So in your example, I'd reshape the y as "y = y.reshape(4)"? Since I'm staggering by two and thus dropping 1 value – xion Aug 18 '17 at 22:52
  • No we just slice off the first row, `y = y[1:,:]` – DJK Aug 18 '17 at 23:00
  • Alright I ran into an error. I have, in my case, 316 total instances, 20 timesteps I want, and 59 x features/columns. The way I reshaped it was as X = np.concatenate([X[x:x+20,:] for x in range(X.shape[0]-1)]) X = X.reshape(315,20,59) However when doing so, I get the error, X = X.reshape(315,20,59) ValueError: cannot reshape array of size 361611 into shape (315,20,59) Any ideas? – xion Aug 18 '17 at 23:49
  • `np.concatenate([X[x:x+20,:].reshape(1,20,59) for x in range(ar.shape[0]-19)])` you can do it in one operation. You have more than 316 cases also – DJK Aug 19 '17 at 00:18
  • No problem, one extra point to remember that I did not mention is a problem with [vanishing gradients](http://harinisuresh.com/2016/10/09/lstms/), its important to understand when working with an RNN. This happens when you have too many time steps. I wont go into detail, but It is important to understand this, its covered very well in the link – DJK Aug 19 '17 at 03:30
  • Very interesting point you brought up. I currently have around 65,000 samples total and I used a timestep of 100 instead of 20. On my fourth epoch, I currently have an accuracy of 92.57% on out of sample performance. It made me worry a little because most finance folk would look at those numbers in disbelief. However, I don't believe I am overfitting as the rest of my code is extremely vetted. Is a 65,000 sample to 100 timestep ratio okay in your expert opinion? – xion Aug 19 '17 at 04:57
  • Also, don't LSTMs automatically handle the problem of vanishing gradients? I'm using a LSTM, not a RNN – xion Aug 19 '17 at 14:18
  • 1
    Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/152353/discussion-between-djk47463-and-xion). – DJK Aug 19 '17 at 21:46
3

This seems to be a time series type of task.
I would start by looking at Recurrent Neural Networks keras

If you want to keep using the modeling you have. (I would not recommend) For time series you may want to transform your data set to some kind of weighted average of last 20 observations (rows).
This way, each of your new data set's observations is the function of the previous 20. This way, that information is present for classification.

You can use something like this for each column if you want the runing sum:

import numpy as np

def running_sum(x, N):
    cumsum = np.cumsum(np.insert(x, 0, 0)) 
    return (cumsum[N:] - cumsum[:-N]) 

x=np.random.rand(200)

print(running_sum(x,20))

Alternately, you could pivot your current data set so that each row has the actual numbers: Add 19 x dimension count columns. Populate with previous observation's data in those. Whether this is possible or practical depends on the shape of your data set.

This is a simple, not too thorough, way to make sure each observation has the data that you think will make a good prediction. You need to be aware of these things:

  1. The modelling method is 'ok' with this not absolute independence of observation.
  2. When you make the prediction for X[i], you have all the information from X[i-20] to X[i-1]

I'm sure there are other considerations that make this not optimal, and am suggesting to use dedicated RNN.

I am aware that djk already pointed out this is RNN, I'm posting this after that answer was accepted per OP's request.

AChervony
  • 663
  • 1
  • 10
  • 15