how to add unrelated training data with an embedding layer?

Question

I am beginner in RNNs and would like to build a model gated recurrent unit GRU for predicting a user's action on an E-commerce website called google merchandize store that sells Google branded merchandise.

We have 5 different actions:

Add to cart
Quickview click
Product click
Remove from cart
Onsite click

My data_y which the target looks like this as we have different actions

array([[0, 0, 0, 1, 0],
    [0, 0, 0, 1, 0],
    [0, 0, 1, 0, 0],
    ...,
    [0, 0, 0, 1, 0],
    [0, 0, 1, 0, 0],
    [1, 0, 0, 0, 0]], dtype=uint8)

By using only the url or the page path the user has accessed, I have achieved 68% prediction accuracy but still trying to improve it by adding another inputs to the model.

My data_X looks like

pagePath                                               

[googleredesign, bags]                                   
[googleredesign, bags]                                    
[googleredesign, electronics]                           
...
...
[googleredesign, bags, backpacks, home]                 
[googleredesign, bags, backpacks, googlealpine...     
   
53087 rows × 2 columns

After getting the vocab length and the max sequence length I tokenized it

tokenizer = Tokenizer(num_words=vocab_length)
tokenizer.fit_on_texts(data_X['pagePath'])
sequences = tokenizer.texts_to_sequences(data_X['pagePath'])   
word_index = tokenizer.word_index
model_inputs = pad_sequences(sequences, maxlen=max_seq_length)
data_X=model_inputs

That's how it looks like after tokenization

array([[ 0,  0,  0,  1,  3],
       [ 0,  0,  0,  1,  3],
       [ 0,  0,  0,  1,  3],
       ...,
       [ 0,  1,  3, 12,  9],
       [ 0,  1,  3, 12,  9],
       [ 0,  1,  3, 12, 81]], dtype=int32)

After that I have splitted that data and trained the model

X_train, X_test, y_train, y_test = train_test_split(data_X, data_y, test_size=0.3, 
 random_state=2)
print(X_train.shape)
print(X_test.shape)  
print(y_train.shape)
print(y_test.shape)

(37160, 5) (15927, 5) (37160, 5) (15927, 5)

embedding_dim = 64
inputs = tf.keras.Input(shape=(max_seq_length,))

embedding = tf.keras.layers.Embedding(
    input_dim=vocab_length,
    output_dim=embedding_dim,
    input_length=max_seq_length
    )(inputs)

gru = tf.keras.layers.GRU(units=embedding_dim)(embedding)

outputs = tf.keras.layers.Dense(5, activation='sigmoid')(gru)


model = tf.keras.Model(inputs, outputs)


model.compile(
 optimizer='adam',
 loss='binary_crossentropy',
 metrics=[
   'accuracy',
   tf.keras.metrics.AUC(name='auc')
 ]
 )

batch_size = 32
epochs = 3

history = model.fit(
   X_train,
   y_train,
   validation_split=0.2,
   batch_size=batch_size,
   epochs=epochs,
   callbacks=[
    tf.keras.callbacks.ReduceLROnPlateau(),
    tf.keras.callbacks.ModelCheckpoint('model.h5', save_best_only=True)
  ] 
  )

So my question is how to add another input to the model for example: if I want to add a column which represents the total time the user spent on the website. How to add it with the embedding layer and it is not tokenized and unrelated to the pagePath column which is tokenized?

score 0 · Answer 1 · answered May 16 '22 at 23:12

0

you can tokenize the main row in the dataset i guess, and after that you can feed the model with the updated dataset and try also to fine tune the validation split. Increasing the number of epoch may also result in a better results

answered May 16 '22 at 23:12

Adham Salah

1

Your answer could be improved with additional supporting information. Please [edit] to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community May 17 '22 at 20:39

how to add unrelated training data with an embedding layer?

1 Answers1