6

I am trying to implement a simple RNN to predict the next integer in an integer sequence. So, I have a data set that is as below:

Id  Sequence
1   1,0,0,2,24,552,21280,103760,70299264,5792853248,587159944704
2   1,1,5,11,35,93,269,747,2115,5933,16717,47003,132291,372157,1047181,2946251,8289731,23323853,65624397,184640891,519507267,1461688413,4112616845,11571284395,32557042499,91602704493,257733967693
4   0,1,101,2,15,102,73,3,40,16,47,103,51,74,116,4,57,41,125,17,12,48,9,104,30,52,141,75,107,117,69,5,148,58,88,42,33,126,152,18,160,13,38,49,55,10,28,105,146,31,158
5   1,4,14,23,42,33,35,34,63,66,87,116,84,101,126,164,128,102,135,143,149,155,203,224,186,204,210,237,261,218,219,286,257,266,361,355,336,302,374,339,371,398,340,409,348,388,494,436,407,406
6   1,1,2,5,4,2,6,13,11,4,10,10,12,6,8,29,16,11,18,20,12,10,22,26,29,12,38,30,28,8,30,61,20,16,24,55,36,18,24,52,40,12,42,50,44,22,46,58,55,29,32,60,52,38,40,78,36,28,58,40,60,30,66,125,48,20,66,80,44,24
9   0,31,59,90,120,151,181,212,243,273,304,334,365,396,424,455,485,516,546,577,608,638,669,699,730,761,789,820,850,881,911,942,973,1003,1034,1064,1095,1126,1155,1186,1216,1247,1277,1308,1339,1369,1400,1430
10  1,1,2,5,13,36,111,347,1134,3832,13126,46281,165283,598401,2202404,8168642,30653724,116082962,442503542,1701654889,6580937039,25603715395,100223117080,394001755683,1556876401398,6178202068457,24608353860698,98421159688268,394901524823138,1589722790850089
12  0,0,0,0,112,40286,5485032,534844548,45066853496,3538771308282,267882021563464,19861835713621616,1453175611052688600,105278656040052332838,7564280930105061931496

My code so far is:

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import math
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import SimpleRNN
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_squared_error
from keras.preprocessing.sequence import pad_sequences

def stoarray(data = [], sep = ','):
    return data.map(lambda x: np.array(x.split(sep), dtype=float))

def create_dataset(dataset, window_size=1):
    dataX, dataY = [], []
    for i in range(len(dataset)-window_size-1):
        a = dataset[i:(i+window_size), 0]
        dataX.append(a)
        dataY.append(dataset[i + window_size, 0]) #gives the ValueError : Can only tuple index with multi index
    return np.array(dataX), np.array(dataY)

# fix random seed for reproducibility
np.random.seed(7)

# loading data
colna = ['id', 'seq']
train_data = pd.read_csv('G:/Python/integer_sequencing/testfile.csv', header=1)
train_data.columns = colna
dataset = train_data['seq']
#print(dataset)
window_size = 1
X_train, Y_train = create_dataset(dataset, window_size)

print(X_train.head(5))
print(Y_train.head(5))

I am trying to split each sequence with X_train as input that consists the complete set except the last term and Y_train be treated as an output will consist only the last digit. But when I run the code, I get the ValueError : Can only tuple-index with a MultiIndex. Can anybody explain what it means with respect to my code and what must I do to resolve it.

The traceback call: enter image description here

PS - I am new to stack overflow and deep learning, so I would be very grateful if you could suggest and help me with formatting of my question.

Nimy Alex
  • 155
  • 1
  • 3
  • 13
  • always put full error message (Traceback) in question - there are other useful inforamtion, ie. it shows which line makes problem. – furas Nov 25 '17 at 20:52
  • In `dataset[i:(i+window_size), 0]` you use tuple `i:(i+window_size), 0` to get data but it doesn't allow for this. You can use only single integer like `dataset[0]` or slice `dataset[i:(i+window_size)]`. In your code `dataset` means `train_data['seq']` which is single column from DataFrame - and it is `Series`, not `DataFrame` with one column. Use `print(type(dataset))` to see it. – furas Nov 25 '17 at 21:10
  • I have added the error message in the question. Hope it's helpful. – Nimy Alex Nov 25 '17 at 21:22
  • As I said you have problem with `dataset[i:(i+window_size), 0]` - it is single column - single-dimensional `Series` - but you use `i:(i+window_size), 0` like in two-dimensional `DataFrame` – furas Nov 25 '17 at 21:30
  • BTW: next time put Traceback as text - it is prefered format for errors and code because everybody can copy text (or its part) to use it in answer. – furas Nov 25 '17 at 21:35
  • That worked! thanks! I will keep that point in mind. – Nimy Alex Nov 25 '17 at 21:37
  • How do I mark your suggestion as the answer?? – Nimy Alex Nov 25 '17 at 21:38
  • I put my comments as answer and then you can mark it as accepted. – furas Nov 25 '17 at 21:39

1 Answers1

7

You have problem with i:(i+window_size), 0 in dataset[i:(i+window_size), 0].

In your code dataset means train_data['seq'] which is single column - single-dimensional Series - but you use i:(i+window_size), 0 like in two-dimensional DataFrame.

You can use only single integer like dataset[0] or slice dataset[i:(i+window_size)]

furas
  • 134,197
  • 12
  • 106
  • 148