I am trying to build a neural network with a sequence to class use case. I have a dataframe with 7 columns:
index ID timestamp x1 x2 x3 date_maturity_encoded target_maturity
79 96273 2015-01-08 [] [] project1 29 06
80 96273 2015-01-08 [] [] project1 29 06
81 96273 2015-01-08 [] [] project1 29 06
82 96273 2015-01-19 [] [] project1 29 06
83 96273 2015-06-15 [] [] project1 39 06
84 96273 2016-02-28 [] [] project2 57 06
85 96274 2015-01-08 [] [] project2 29 08
86 96274 2015-01-08 [] [] project2 29 08
87 96274 2015-01-08 [] [] project2 29 08
88 96274 2015-02-26 [] [] project2 29 08
89 96274 2015-03-02 prg46 X1.80 [] project2 29 08
90 96274 2015-03-27 [] [] project2 35 08
91 96274 2015-04-09 [] [] project2 35 08
92 96274 2015-04-21 prg46 X1.80 [] project2 37 08
93 96274 2015-06-09 [] [] project2 39 08
94 96274 2015-06-23 [] [] project2 40 08
95 96274 2015-08-03 CW_38/15 [] project2 40 08
96 96274 2015-09-09 [] [] project2 52 08
97 96274 2015-09-21 [] [] project2 29 08
98 96274 2015-10-09 [] [] project2 29 08
99 96274 2016-03-01 CW_38/15 [] project2 57 08
- The first 6 columns are going to be the input and the 7th column is the output.
ID
andx3
are attributes the dataset needs to be grouped and aggregated by.- There is always one
x3
perID
. AnID
can havei
rows. - Columns
x1
andx2
are filled with strings.timestamp
column are dates.
target_maturity
is the target value which needs to be predicted.
First of all I am encoding the target value with LabelEncoder:
### ENCODE PROJECTS WITH LABEL ENCODER
le = preprocessing.LabelEncoder()
le.fit(df.x3.unique())
df["x3_encoded"] = le.transform(df["x3"])
### ENCODE OUTPUT DATA
le.fit(df.target_maturity.unique())
df["target_maturity_encoded"] = le.transform(df["target_maturity"])
target = df.drop_duplicates(subset='ID', keep='first') #keep the first occurence of target value per ID
target = target['target_maturity_encoded']
Next I will manipulate the strings in x1/x2 to numeric sequences:
tok = Tokenizer(char_level=True)
df['x1'] = [str(i) for i in df['x1']]
tok.fit_on_texts(df['x1'])
df['x1'] = tok.texts_to_sequences(df['x1'])
df['x2'] = [str(i) for i in df['x2']]
tok.fit_on_texts(df['x2'])
df['x2'] = tok.texts_to_sequences(df['x2'])
index ID timestamp x1 x2 x3_encoded date_maturity_encoded target_maturity_encoded
79 96273 2015-01-08 [1, 2] [2, 1] 1 29 3
80 96273 2015-01-08 [1, 2] [2, 1] 1 29 3
81 96273 2015-01-08 [1, 2] [2, 1] 1 29 3
82 96273 2015-01-19 [1, 2] [2, 1] 1 29 3
83 96273 2015-06-15 [1, 2] [2, 1] 1 39 3
84 96273 2016-02-28 [1, 2] [2, 1] 1 57 3
85 96274 2015-01-08 [1, 2] [2, 1] 2 29 5
86 96274 2015-01-08 [1, 2] [2, 1] 2 29 5
87 96274 2015-01-08 [1, 2] [2, 1] 2 29 5
88 96274 2015-02-26 [1, 2] [2, 1] 2 29 5
89 96274 2015-03-02 [3, 3, 24, 18, 40, 23, 21, 3, 25, 5, 14, 16, 4] [2, 1] 2 29 5
90 96274 2015-03-27 [1, 2] [2, 1] 2 35 5
91 96274 2015-04-09 [1, 2] [2, 1] 2 35 5
92 96274 2015-04-21 [3, 24, 18, 40, 23, 21, 3, 25, 5, 14, 16, 4] [2, 1] 2 37 5
93 96274 2015-06-09 [1, 2] [2, 1] 2 39 5
94 96274 2015-06-23 [1, 2] [2, 1] 2 40 5
95 96274 2015-08-03 [3, 3, 42, 13, 7, 15, 16, 39, 5, 22] [2, 1] 2 40 5
96 96274 2015-09-09 [1, 2] [2, 1] 2 52 5
97 96274 2015-09-21 [1, 2] [2, 1] 2 29 5
98 96274 2015-10-09 [1, 2] [2, 1] 2 29 5
99 96274 2016-03-01 [42, 13, 7, 15, 16, 39, 5, 22] [2, 1] 2 57 5
Since I am trying to predict one target value per ID, and since one project is the same for one ID, I group my data as follows:
df = df[['ID', 'x3_encoded', 'timestamp', 'x1', 'x2', 'date_maturity_encoded']] # changing order and filtering out output data
data = df.groupby(['ID','x3_encoded']).agg(lambda x: x.tolist()) # aggregating dataframe as dataframe of lists
ID x3_encoded timestamp x1 x2 date_maturity_encoded
96273 1 [2015-01-08, 2015-01-08, 2015-01-08, 2015-01-1... [[1, 2], [1, 2], [1, 2], [1, 2], [1, 2], [1, 2]] [[2, 1], [2, 1], [2, 1], [2, 1], [2, 1], [2, 1]] [29, 29, 29, 29, 39, 57]
96274 2 [2015-01-08, 2015-01-08, 2015-01-08, 2015-02-2... [[1, 2], [1, 2], [1, 2], [1, 2], [3, 3, 24, 18... [[2, 1], [2, 1], [2, 1], [2, 1], [2, 1], [2, 1... [29, 29, 29, 29, 29, 35, 35, 37, 39, 40, 40, 5...
Defining number of output classes:
### ENCODE list_maturities
num_classes = len(np.unique(df[['vr_maturity', 'date_maturity']].values)) # (0-127) 128 classes in total
One hot encoding output:
output_data = k.utils.to_categorical(target, num_classes = num_classes)
Create an array from data as input:
data_array = data.to_numpy(dtype=object)
Train test split:
input_shape = data_array[0].shape
x_train, x_test, y_train, y_test = train_test_split(data_matrix,
output_data,
test_size=0.1,
shuffle = True)
Fit Model:
model = Sequential()
model.add(Dense(units=8, activation='relu', input_shape=input_shape))
model.add(Dropout(0.2))
model.add(Dense(units=16, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(num_classes, activation='softmax'))
model.build(input_shape)
model.summary()
model.compile(loss='categorical_crossentropy',
optimizer='adam',
metrics=['accuracy'])
history = model.fit(x_train, y_train,
batch_size=10000,
epochs=5,
verbose=1,
validation_split=0.1)
After all is said and done, I am receiving the error. I have also tried manipulating each element in the input data as arrays, but I cannot event manipulate x_train
without receiving the error.
x_tr = np.asarray([np.asarray(row, dtype=float) for row in x_train], dtype=float)
y_tr = np.asarray([np.asarray(row, dtype=float) for row in y_train], dtype=float)
How can I fit sequences in a dataframe filled with strings to a multi-class problem? Transforming the sequences to matrices with keras messes up the dataframe. I have no idea how this can be solved at all after reading through every post with the same error when using keras.
2019-11-15 23:28:39.184411: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
Traceback (most recent call last):
File "C:\Users\reszi\Anaconda3\envs\deeplearning\lib\site-packages\IPython\core\interactiveshell.py", line 3296, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-105-49dec6ee8dff>", line 28, in <module>
validation_split=0.1)
File "C:\Users\reszi\Anaconda3\envs\deeplearning\lib\site-packages\keras\engine\training.py", line 1039, in fit
validation_steps=validation_steps)
File "C:\Users\reszi\Anaconda3\envs\deeplearning\lib\site-packages\keras\engine\training_arrays.py", line 199, in fit_loop
outs = f(ins_batch)
File "C:\Users\reszi\Anaconda3\envs\deeplearning\lib\site-packages\keras\backend\tensorflow_backend.py", line 2715, in __call__
return self._call(inputs)
File "C:\Users\reszi\Anaconda3\envs\deeplearning\lib\site-packages\keras\backend\tensorflow_backend.py", line 2655, in _call
dtype=tf.as_dtype(tensor.dtype).as_numpy_dtype))
File "C:\Users\reszi\Anaconda3\envs\deeplearning\lib\site-packages\numpy\core\_asarray.py", line 85, in asarray
return array(a, dtype, copy=False, order=order)
ValueError: setting an array element with a sequence.
Following @ DanielMöller 's advice this is as far as I came:
Before tokenizing sequences:
### - Convert the timestamps into numbers and normalize them
df['timestamp_int'] = pd.to_datetime(df['timestamp']).astype('int64')
df['timestamp_int'].head()
max_a = df.timestamp_int.max()
min_a = df.timestamp_int.min()
min_norm = 0
max_norm = 1
df['timestamp_NORMA'] = (df.timestamp_int - min_a) * (max_norm - min_norm) / (max_a - min_a) + min_norm
df['timestamp_NORMA'].head()
One - Hot Encoding:
df["date_maturity_one_hot"] = ""
num_classes = len(np.unique(list_maturities_encoded))
df["date_maturity_one_hot"] =
k.utils.to_categorical(df["date_maturity_encoded"], num_classes=num_classes).tolist()
After tokenizing sequences:
Zero_pad x1 and x2:
df['x1_pad'] = ""
df['x1_pad'] = pad_sequences(df['x1'], maxlen=max(df.x1.apply(len))).tolist()
df['x2_pad'] = ""
df['x2_pad'] = pad_sequences(df['x2'], maxlen=max(df.x2.apply(len))).tolist()
Group by ID and x3_encoded:
agg_input_data = df.groupby(['ID', 'x3_encoded']).agg(lambda: x.to_list()).reset_index()
Zero_pad lists of lists:
cols = ['timestamp_NORMA', 'x1_pad', 'x2_pad', 'date_maturity_one_hot']
max_len = 118 # maximum rows an ID has in df
for i, r in agg_input_data.iterrows():
for col in cols:
max_char = max(input_data[col].apply(len)) ### number of characters in column
N = max_len - len(agg_input_data.loc[i, col]) ### number of padding difference (118 - len(list of lists in column)
agg_input_data.at[i, col] = [[0] * max_char] * N + agg_input_data.at[i, col]
Multiple inputs treatment:
max_timestamp_NORMA_length = max(agg_input_data.timestamp_NORMA.apply(len))
max_x1_pad_length = max(agg_input_data.x1_pad.apply(len))
max_x2_pad_length = max(agg_input_data.x2_pad.apply(len))
timeStampInput = Input((max_timestamp_NORMA_length,))
x1Input = Input((max_timestamp_NORMA_pad_length, max_x1_pad_length))
x2Input = Input((max_timestamp_NORMA_pad_length, max_x2_pad_length))
maturityInput = Input((max_timestamp_NORMA_pad_length,))
Embedding:
characterEmbedding = Embedding(298, 128) # max_chars & embedding_size
x1Embed = characterEmbedding(x1Input)
x2Embed = characterEmbedding(x2Input)
maturityEmbed = Embedding(127, 12)(maturityInput) # number_of_maturity_classes, embedding_size_2
In:
timeStampInput.shape
Out[57]:
TensorShape([Dimension(None), Dimension(118)])
In:
maturityEmbed.shape
Out[58]:
TensorShape([Dimension(None), Dimension(118), Dimension(12)])
Reducing length of sequences with LSTM:
timeStampEncoded = LSTM(118)(timeStampInput)
timeStampEncoded = LSTM(118)(timeStampInput) Traceback(most recent call last): File "C:\Users\reszi\Anaconda3\envs\deeplearning\lib\site-packages\IPython\core\interactiveshell.py", line 3296, in run_code exec(code_obj, self.user_global_ns, self.user_ns) File "", line 1, in < module > timeStampEncoded = LSTM(118)(timeStampInput) File"C:\Users\reszi\Anaconda3\envs\deeplearning\lib\site-packages\keras\layers\recurrent.py", line 532, in call return super(RNN, self).call(inputs, **kwargs) File "C:\Users\reszi\Anaconda3\envs\deeplearning\lib\site-packages\keras\engine\base_layer.py", line 414, in call self.assert_input_compatibility(inputs) File "C:\Users\reszi\Anaconda3\envs\deeplearning\lib\site-packages\keras\engine\base_layer.py", line 311, in assert_input_compatibility str(K.ndim(x)))
ValueError: Input 0 is incompatible layer lstm_1: expected = 3, found ndim = 2