3

I am trying to build a neural network with a sequence to class use case. I have a dataframe with 7 columns:

index    ID    timestamp                     x1                   x2                 x3           date_maturity_encoded    target_maturity

79      96273  2015-01-08                    []                   []                project1                 29          06
80      96273  2015-01-08                    []                   []                project1                 29          06
81      96273  2015-01-08                    []                   []                project1                 29          06
82      96273  2015-01-19                    []                   []                project1                 29          06
83      96273  2015-06-15                    []                   []                project1                 39          06
84      96273  2016-02-28                    []                   []                project2                 57          06
85      96274  2015-01-08                    []                   []                project2                 29          08
86      96274  2015-01-08                    []                   []                project2                 29          08
87      96274  2015-01-08                    []                   []                project2                 29          08
88      96274  2015-02-26                    []                   []                project2                 29          08
89      96274  2015-03-02           prg46 X1.80                   []                project2                 29          08
90      96274  2015-03-27                    []                   []                project2                 35          08
91      96274  2015-04-09                    []                   []                project2                 35          08
92      96274  2015-04-21           prg46 X1.80                   []                project2                 37          08
93      96274  2015-06-09                    []                   []                project2                 39          08
94      96274  2015-06-23                    []                   []                project2                 40          08
95      96274  2015-08-03              CW_38/15                   []                project2                 40          08
96      96274  2015-09-09                    []                   []                project2                 52          08
97      96274  2015-09-21                    []                   []                project2                 29          08
98      96274  2015-10-09                    []                   []                project2                 29          08
99      96274  2016-03-01              CW_38/15                   []                project2                 57          08
  • The first 6 columns are going to be the input and the 7th column is the output.
  • ID and x3 are attributes the dataset needs to be grouped and aggregated by.
  • There is always one x3 per ID. An ID can have i rows.
  • Columns x1 and x2 are filled with strings. timestamp column are dates.

target_maturity is the target value which needs to be predicted.

First of all I am encoding the target value with LabelEncoder:

### ENCODE PROJECTS WITH LABEL ENCODER
le = preprocessing.LabelEncoder()
le.fit(df.x3.unique())
df["x3_encoded"] = le.transform(df["x3"])


### ENCODE OUTPUT DATA
le.fit(df.target_maturity.unique())
df["target_maturity_encoded"] = le.transform(df["target_maturity"])
target = df.drop_duplicates(subset='ID', keep='first') #keep the first occurence of target value per ID
target = target['target_maturity_encoded']

Next I will manipulate the strings in x1/x2 to numeric sequences:

tok = Tokenizer(char_level=True)
df['x1'] = [str(i) for i in df['x1']]
tok.fit_on_texts(df['x1'])
df['x1'] = tok.texts_to_sequences(df['x1'])


df['x2'] = [str(i) for i in df['x2']]
tok.fit_on_texts(df['x2'])
df['x2'] = tok.texts_to_sequences(df['x2'])
index    ID    timestamp                        x1                                        x2                 x3_encoded  date_maturity_encoded    target_maturity_encoded

79      96273  2015-01-08                                           [1, 2]               [2, 1]                   1                     29          3
80      96273  2015-01-08                                           [1, 2]               [2, 1]                   1                     29          3
81      96273  2015-01-08                                           [1, 2]               [2, 1]                   1                     29          3
82      96273  2015-01-19                                           [1, 2]               [2, 1]                   1                     29          3
83      96273  2015-06-15                                           [1, 2]               [2, 1]                   1                     39          3
84      96273  2016-02-28                                           [1, 2]               [2, 1]                   1                     57          3
85      96274  2015-01-08                                           [1, 2]               [2, 1]                   2                     29          5
86      96274  2015-01-08                                           [1, 2]               [2, 1]                   2                     29          5
87      96274  2015-01-08                                           [1, 2]               [2, 1]                   2                     29          5
88      96274  2015-02-26                                           [1, 2]               [2, 1]                   2                     29          5
89      96274  2015-03-02  [3, 3, 24, 18, 40, 23, 21, 3, 25, 5, 14, 16, 4]               [2, 1]                   2                     29          5
90      96274  2015-03-27                                           [1, 2]               [2, 1]                   2                     35          5
91      96274  2015-04-09                                           [1, 2]               [2, 1]                   2                     35          5
92      96274  2015-04-21     [3, 24, 18, 40, 23, 21, 3, 25, 5, 14, 16, 4]               [2, 1]                   2                     37          5
93      96274  2015-06-09                                           [1, 2]               [2, 1]                   2                     39          5
94      96274  2015-06-23                                           [1, 2]               [2, 1]                   2                     40          5
95      96274  2015-08-03             [3, 3, 42, 13, 7, 15, 16, 39, 5, 22]               [2, 1]                   2                     40          5
96      96274  2015-09-09                                           [1, 2]               [2, 1]                   2                     52          5
97      96274  2015-09-21                                           [1, 2]               [2, 1]                   2                     29          5
98      96274  2015-10-09                                           [1, 2]               [2, 1]                   2                     29          5
99      96274  2016-03-01                   [42, 13, 7, 15, 16, 39, 5, 22]               [2, 1]                   2                     57          5

Since I am trying to predict one target value per ID, and since one project is the same for one ID, I group my data as follows:

df = df[['ID', 'x3_encoded', 'timestamp', 'x1', 'x2',  'date_maturity_encoded']] # changing order and filtering out output data
data = df.groupby(['ID','x3_encoded']).agg(lambda x: x.tolist()) # aggregating dataframe as dataframe of lists
ID      x3_encoded       timestamp                                              x1                                          x2                                                        date_maturity_encoded
96273    1    [2015-01-08, 2015-01-08, 2015-01-08, 2015-01-1...    [[1, 2], [1, 2], [1, 2], [1, 2], [1, 2], [1, 2]]   [[2, 1], [2, 1], [2, 1], [2, 1], [2, 1], [2, 1]]   [29, 29, 29, 29, 39, 57]  
96274    2     [2015-01-08, 2015-01-08, 2015-01-08, 2015-02-2...   [[1, 2], [1, 2], [1, 2], [1, 2], [3, 3, 24, 18...  [[2, 1], [2, 1], [2, 1], [2, 1], [2, 1], [2, 1...  [29, 29, 29, 29, 29, 35, 35, 37, 39, 40, 40, 5...

Defining number of output classes:

### ENCODE list_maturities
num_classes = len(np.unique(df[['vr_maturity', 'date_maturity']].values)) # (0-127) 128 classes in total

One hot encoding output:

output_data = k.utils.to_categorical(target, num_classes = num_classes)

Create an array from data as input:

data_array = data.to_numpy(dtype=object) 

Train test split:

input_shape = data_array[0].shape
x_train, x_test, y_train, y_test = train_test_split(data_matrix,
                                                    output_data,
                                                    test_size=0.1,
                                                    shuffle = True)

Fit Model:

model = Sequential()
model.add(Dense(units=8, activation='relu', input_shape=input_shape))
model.add(Dropout(0.2))
model.add(Dense(units=16, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(num_classes, activation='softmax'))
model.build(input_shape)
model.summary()

model.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

history = model.fit(x_train, y_train,
                    batch_size=10000,
                    epochs=5,
                    verbose=1,
                    validation_split=0.1)

After all is said and done, I am receiving the error. I have also tried manipulating each element in the input data as arrays, but I cannot event manipulate x_train without receiving the error.

x_tr = np.asarray([np.asarray(row, dtype=float) for row in x_train], dtype=float)
y_tr = np.asarray([np.asarray(row, dtype=float) for row in y_train], dtype=float)

How can I fit sequences in a dataframe filled with strings to a multi-class problem? Transforming the sequences to matrices with keras messes up the dataframe. I have no idea how this can be solved at all after reading through every post with the same error when using keras.

2019-11-15 23:28:39.184411: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
Traceback (most recent call last):
  File "C:\Users\reszi\Anaconda3\envs\deeplearning\lib\site-packages\IPython\core\interactiveshell.py", line 3296, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-105-49dec6ee8dff>", line 28, in <module>
    validation_split=0.1)
  File "C:\Users\reszi\Anaconda3\envs\deeplearning\lib\site-packages\keras\engine\training.py", line 1039, in fit
    validation_steps=validation_steps)
  File "C:\Users\reszi\Anaconda3\envs\deeplearning\lib\site-packages\keras\engine\training_arrays.py", line 199, in fit_loop
    outs = f(ins_batch)
  File "C:\Users\reszi\Anaconda3\envs\deeplearning\lib\site-packages\keras\backend\tensorflow_backend.py", line 2715, in __call__
    return self._call(inputs)
  File "C:\Users\reszi\Anaconda3\envs\deeplearning\lib\site-packages\keras\backend\tensorflow_backend.py", line 2655, in _call
    dtype=tf.as_dtype(tensor.dtype).as_numpy_dtype))
  File "C:\Users\reszi\Anaconda3\envs\deeplearning\lib\site-packages\numpy\core\_asarray.py", line 85, in asarray
    return array(a, dtype, copy=False, order=order)
ValueError: setting an array element with a sequence.

Following @ DanielMöller 's advice this is as far as I came:

Before tokenizing sequences:

### - Convert the timestamps into numbers and normalize them
df['timestamp_int'] = pd.to_datetime(df['timestamp']).astype('int64')
df['timestamp_int'].head()
max_a = df.timestamp_int.max()
min_a = df.timestamp_int.min()
min_norm = 0
max_norm = 1
df['timestamp_NORMA'] = (df.timestamp_int - min_a) * (max_norm - min_norm) / (max_a - min_a) + min_norm
df['timestamp_NORMA'].head()

One - Hot Encoding:

df["date_maturity_one_hot"] = ""
num_classes = len(np.unique(list_maturities_encoded))
df["date_maturity_one_hot"] =
k.utils.to_categorical(df["date_maturity_encoded"], num_classes=num_classes).tolist()

After tokenizing sequences:

Zero_pad x1 and x2:

df['x1_pad'] = ""
df['x1_pad'] = pad_sequences(df['x1'], maxlen=max(df.x1.apply(len))).tolist()

df['x2_pad'] = ""
df['x2_pad'] = pad_sequences(df['x2'], maxlen=max(df.x2.apply(len))).tolist()

Group by ID and x3_encoded:

agg_input_data = df.groupby(['ID', 'x3_encoded']).agg(lambda: x.to_list()).reset_index()

Zero_pad lists of lists:

cols = ['timestamp_NORMA', 'x1_pad', 'x2_pad', 'date_maturity_one_hot']
max_len = 118  # maximum rows an ID has in df

for i, r in agg_input_data.iterrows():
    for col in cols:
        max_char = max(input_data[col].apply(len))  ### number of characters in column
        N = max_len - len(agg_input_data.loc[i, col])  ### number of padding difference (118 - len(list of lists in column)
        agg_input_data.at[i, col] = [[0] * max_char] * N + agg_input_data.at[i, col]

Multiple inputs treatment:

max_timestamp_NORMA_length = max(agg_input_data.timestamp_NORMA.apply(len))
max_x1_pad_length = max(agg_input_data.x1_pad.apply(len))
max_x2_pad_length = max(agg_input_data.x2_pad.apply(len))

timeStampInput = Input((max_timestamp_NORMA_length,))
x1Input = Input((max_timestamp_NORMA_pad_length, max_x1_pad_length))
x2Input = Input((max_timestamp_NORMA_pad_length, max_x2_pad_length))
maturityInput = Input((max_timestamp_NORMA_pad_length,))

Embedding:

characterEmbedding = Embedding(298, 128)  # max_chars & embedding_size
x1Embed = characterEmbedding(x1Input)
x2Embed = characterEmbedding(x2Input)

maturityEmbed = Embedding(127, 12)(maturityInput)  # number_of_maturity_classes, embedding_size_2

In:

timeStampInput.shape

Out[57]:

TensorShape([Dimension(None), Dimension(118)])

In:

maturityEmbed.shape

Out[58]:

TensorShape([Dimension(None), Dimension(118), Dimension(12)])

Reducing length of sequences with LSTM:

timeStampEncoded = LSTM(118)(timeStampInput)

timeStampEncoded = LSTM(118)(timeStampInput) Traceback(most recent call last): File "C:\Users\reszi\Anaconda3\envs\deeplearning\lib\site-packages\IPython\core\interactiveshell.py", line 3296, in run_code exec(code_obj, self.user_global_ns, self.user_ns) File "", line 1, in < module > timeStampEncoded = LSTM(118)(timeStampInput) File"C:\Users\reszi\Anaconda3\envs\deeplearning\lib\site-packages\keras\layers\recurrent.py", line 532, in call return super(RNN, self).call(inputs, **kwargs) File "C:\Users\reszi\Anaconda3\envs\deeplearning\lib\site-packages\keras\engine\base_layer.py", line 414, in call self.assert_input_compatibility(inputs) File "C:\Users\reszi\Anaconda3\envs\deeplearning\lib\site-packages\keras\engine\base_layer.py", line 311, in assert_input_compatibility str(K.ndim(x)))

ValueError: Input 0 is incompatible layer lstm_1: expected = 3, found ndim = 2

Audiogott
  • 95
  • 2
  • 12

1 Answers1

3

This is the case of having lists as elements of the array. A numpy array for keras must have all values of the same kind and a fixed length.

The best you can do now is to separate each column in a different X array.

Now there is a lot of treatment you need to do with that data so it can enter a neural network. You should probably convert the dates in numbers, the classes in one-hot encodings and, the worst part, decide what to do with the lists of lists in x1 and x2.

What I can see is that you will need:

Before the aggregation:

  • Convert the timestamps into numbers and normalize them
  • Pad the x1 and x2 sequences with zeros so all sequences have the same length
  • Read about pad_sequences here
  • Notice that you have to treat them as lists, not as a huge string

After the aggregation:

  • Pad the time stamp sequence
  • Pad the date maturity sequence
  • Pad the x1 and x2 sequence again (because it is a list of lists, you did it for the inner lists, now you are doing it for the outer list, you need to pad with numpy arrays of the same size of the inner lists)

Finally, your model will need multiple inputs and treatment of these sequences:

timeStampInput = Input((max_time_length,))
x1Input = Input((max_time_length, max_x1_length))
x2Input = Input((max_time_length, max_x2_length))
maturityInput = Input((max_time_length,))

You will need to pass the encoded inputs through embeddings so they have meaningful values for the model. Ideally, you should have encoded x1 and x2 together since they're char sequences, this will make you need only one embedding instead of two.

characterEmbedding = Embedding(max_chars, embedding_size)
x1Embed = characterEmbedding(x1Input)
x2Embed = characterEmbedding(x2Input)

maturityEmbed = Embedding(number_of_maturity_classes, embedding_size_2)(maturityInput)

Now you will have to reduce the length of the sequences. LSTM layers should do this well. (You can also try Conv1D with global poolings too)

For maturity, which at this point should have shape (batch, max_time_length, embedding_size_2), just a regular LSTM. For timestamp too

timeStampEncoded = LSTM(units_1)(timeStampInput)
maturityEncoded = LSTM(units_2)(maturityEmbed)

Now for x1 and x2, you need this in two levels, because they're sequences of sequences:

#inner dimension
x1Encoded = TimeDistributed(LSTM(units_in))(x1Embed)
x2Encoded = TimeDistributed(LSTM(units_in))(x2Embed)

#outer dimension
x1Encoded = LSTM(units_out)(x1Encoded)
x2Encoded = LSTM(units_out)(x2Encoded)

Finally you can concatenate everything:

allInputs = Concatenate()([timeStampEncoded, maturityEncoded, x1Encoded, x2Encoded])

Now you're free to go with the regular 2D model:

out = Dense(units=8, activation='relu', input_shape=input_shape)(allInputs)
out = Dropout(0.2)(out)
out = Dense(units=16, activation='relu')(out)
out = Dropout(0.2)(out)
out = Dense(num_classes, activation='softmax')(out)

model = Model([timeStampInput, x1Input, x2Input, maturityInput], out)

You will need to train the model with four inputs:

model.fit([timeStampArray, x1Array, x2Array, maturityArray], labels)

Notice that the shapes of the data should be something like:

  • timeStampArray.shape = (data_frame_length, max_time_length)
  • x1Array.shape = (data_frame_length, max_time_length, max_x1_length)
  • x2Array.shape = (data_frame_length, max_time_length, max_x2_length)
  • maturityArray.shape = (data_frame_length, max_time_length)

I'm afraid it's not possible to give anything better than this. You must search questions about preprocessing sequences for LSTM to understand better what to do.

Daniel Möller
  • 84,878
  • 18
  • 192
  • 214
  • Thank you for such a detailed response. As I suspected, the issue lies within the variant shape of the input data. I have a few questions though: 1. What do you mean by converting timestamps into numbers and normalizing them? Does it mean removing the ```-``` character, converting to ```int()``` and then normalizing the value? Something like [here](https://stackoverflow.com/questions/31036148/how-to-standardize-normalize-a-date-with-pandas-numpy)? 2. Would it be possible to skip the padding step by transforming ```x1``` & ```x2``` with the ```keras.Tokenizer.sequences_to_matrix()``` function? – Audiogott Nov 19 '19 at 13:45
  • For 1, yes, that seems like it. Transform to number and normalize (I don't think you need to remove the character, most datetimes can be converted to int64 without loss, the result is usually the number of miliseconds from a certain date). – Daniel Möller Nov 19 '19 at 14:05
  • I believe sequences to matrix will only create a one-hot version of the sequences, so no. You need the padding, and you will only use one-hot if you are not using embeddings. – Daniel Möller Nov 19 '19 at 14:07
  • Hello Daniel, my next questions are about the steps after the aggregation. 1. Is padding the ```date_maturity``` necessary? It is categorical data. Shouldn't one hot encoding suffice? 2. I do not understand the step of padding ```x1``` and ```x2``` again. Columns ```x1``` and ```x2``` have their own respective length after ```pad_sequences```. Should both columns have the same length in the end? 3. I suppose ```max_chars``` is the maximum length of sequences. What about the ```embedding_size```? Thank you in advance. – Audiogott Nov 20 '19 at 18:23
  • `x1` and `x2` are already sequences before aggregation. (You need to pad them to the `max_chars` size. (It's the maximum size of the "character sequences"). – Daniel Möller Nov 20 '19 at 18:33
  • After you aggregate, you transform everything into a sequence in time. So, while every aggregated field becomes a sequence, the fields that already were sequences become "sequences of sequences". Every sequence should be padded to the max length. At the char level (the inner `x1` and `x2` sequences), they should be padded to `max_chars` length. At the time level (all the others, including the outer `x1` and `x2`), they must be padded to `max_time_length`). ---- One hot encoding does not change the length of the sequences. (All sequences must have the same length for training) – Daniel Möller Nov 20 '19 at 18:36
  • `embedding_size` is arbitrary. If you choose one-hot encoding, your data will have shape `(batch, length_in_time, length_chars, number_of_existing_chars_in_dictionary)`. But when you use an embedding your data will have shape `(batch, length_in_time, length_chars, embedding_size)`. It's a way to make your data smaller than one-hot, and make your model go faster. – Daniel Möller Nov 20 '19 at 18:39
  • I am not done yet, however I will give you the bounty, since you're the only one who helped out and your help is very valuable. Thank you, Daniel. – Audiogott Nov 24 '19 at 12:20
  • I have updated my code. However now, that I am trying to reduce the Timestamp sequence with LSTM, I am receving the following error message: ```ValueError: Input 0 is incompatible layer lstm_1: expected = 3, found ndim = 2``` – Audiogott Nov 24 '19 at 13:03
  • Hmm, the shapes needed for LSTM are 3D: `(batch, length, features)`, if you're reducing the timestamp alone, then it's 1 feature only: `(batch, time_length, 1)`. Check your input shapes. If the input is going through an Embedding first, it must be `(batch, length)` before the embedding and `(batch, length, embedding_size)` after the embedding. – Daniel Möller Nov 24 '19 at 18:10
  • You may also concatenate everything (considering you have already reduced the inner `x1` and `x2` into a single input to a single LSTM (this might add a little more intelligence to the model) – Daniel Möller Nov 24 '19 at 18:11