16

I would like to build a one layer LSTM model with embeddings for my categorical features. I currently have numerical features and a few categorical features, such as Location, which can't be one-hot encoded e.g. using pd.get_dummies() due to computational complexity, which is what I originally intended to do.

Let's visualise an example:

Sample Data

data = {
    'user_id': [1,1,1,1,2,2,3],
    'time_on_page': [10,20,30,20,15,10,40],
    'location': ['London','New York', 'London', 'New York', 'Hong Kong', 'Tokyo', 'Madrid'],
    'page_id': [5,4,2,1,6,8,2]
}
d = pd.DataFrame(data=data)
print(d)
   user_id  time_on_page   location  page_id
0        1            10     London        5
1        1            20   New York        4
2        1            30     London        2
3        1            20   New York        1
4        2            15  Hong Kong        6
5        2            10      Tokyo        8
6        3            40     Madrid        2

Let's look at the person visiting a website. I'm tracking numerical data such as time on page and others. Categorical data includes: Location (over 1000 uniques), Page_id (> 1000 uniques), Author_id (100+ uniques). The simplest solution would be to one-hot encoding everything and put this into LSTM with variable sequence lengths, each timestep corresponding to a different page view.

The above DataFrame will generate 7 training samples, with variable sequence lengths. For example, for user_id=2 I will have 2 training samples:

[ ROW_INDEX_4 ] and [ ROW_INDEX_4, ROW_INDEX_5 ]

Let X be the training data, and let's look at the first training sample X[0].

enter image description here

From the picture above, my categorical features are X[0][:, n:].

Before creating sequences, I factorized the categorical variables into [0,1... number_of_cats-1], using pd.factorize() so the data in X[0][:, n:] is numbers corresponding to their index.

Do I need to create an Embedding for each of the Categorical Features separately? E.g. an embedding for each of x_*n, x_*n+1, ..., x_*m?

If so, how do I put this into Keras code?

model = Sequential()

model.add(Embedding(?, ?, input_length=variable)) # How do I feed the data into this embedding? Only the categorical inputs.

model.add(LSTM())
model.add(Dense())
model.add.Activation('sigmoid')
model.compile()

model.fit_generator() # fits the `X[i]` one by one of variable length sequences.

My solution idea:

Something that looks like:

enter image description here

I can train a Word2Vec model on every single categorical feature (m-n) to vectorise any given value. E.g. London will be vectorised in 3 dimensions. Let's suppose I use 3 dimensional embeddings. Then I will put everything back into the X matrix, which will now have n + 3(n-m), and use the LSTM model to train it?

I just think there should be an easier/smarter way.

today
  • 32,602
  • 8
  • 95
  • 115
GRS
  • 2,807
  • 4
  • 34
  • 72

2 Answers2

17

One solution, as you mentioned, is to one-hot encode the categorical data (or even use them as they are, in index-based format) and feed them along the numerical data to an LSTM layer. Of course, you can also have two LSTM layers here, one for processing the numerical data and another for processing categorical data (in one-hot encoded format or index-based format) and then merge their outputs.

Another solution is to have one separate embedding layer for each of those categorical data. Each embedding layer may have its own embedding dimension (and as suggested above, you may have more than one LSTM layer for processing numerical and categorical features separately):

num_cats = 3 # number of categorical features
n_steps = 100 # number of timesteps in each sample
n_numerical_feats = 10 # number of numerical features in each sample
cat_size = [1000, 500, 100] # number of categories in each categorical feature
cat_embd_dim = [50, 10, 100] # embedding dimension for each categorical feature

numerical_input = Input(shape=(n_steps, n_numerical_feats), name='numeric_input')
cat_inputs = []
for i in range(num_cats):
    cat_inputs.append(Input(shape=(n_steps,1), name='cat' + str(i+1) + '_input'))

cat_embedded = []
for i in range(num_cats):
    embed = TimeDistributed(Embedding(cat_size[i], cat_embd_dim[i]))(cat_inputs[i])
    cat_embedded.append(embed)

cat_merged = concatenate(cat_embedded)
cat_merged = Reshape((n_steps, -1))(cat_merged)
merged = concatenate([numerical_input, cat_merged])
lstm_out = LSTM(64)(merged)

model = Model([numerical_input] + cat_inputs, lstm_out)
model.summary()

Here is the model summary:

Layer (type)                    Output Shape         Param #     Connected to                     
==================================================================================================
cat1_input (InputLayer)         (None, 100, 1)       0                                            
__________________________________________________________________________________________________
cat2_input (InputLayer)         (None, 100, 1)       0                                            
__________________________________________________________________________________________________
cat3_input (InputLayer)         (None, 100, 1)       0                                            
__________________________________________________________________________________________________
time_distributed_1 (TimeDistrib (None, 100, 1, 50)   50000       cat1_input[0][0]                 
__________________________________________________________________________________________________
time_distributed_2 (TimeDistrib (None, 100, 1, 10)   5000        cat2_input[0][0]                 
__________________________________________________________________________________________________
time_distributed_3 (TimeDistrib (None, 100, 1, 100)  10000       cat3_input[0][0]                 
__________________________________________________________________________________________________
concatenate_1 (Concatenate)     (None, 100, 1, 160)  0           time_distributed_1[0][0]         
                                                                 time_distributed_2[0][0]         
                                                                 time_distributed_3[0][0]         
__________________________________________________________________________________________________
numeric_input (InputLayer)      (None, 100, 10)      0                                            
__________________________________________________________________________________________________
reshape_1 (Reshape)             (None, 100, 160)     0           concatenate_1[0][0]              
__________________________________________________________________________________________________
concatenate_2 (Concatenate)     (None, 100, 170)     0           numeric_input[0][0]              
                                                                 reshape_1[0][0]                  
__________________________________________________________________________________________________
lstm_1 (LSTM)                   (None, 64)           60160       concatenate_2[0][0]              
==================================================================================================
Total params: 125,160
Trainable params: 125,160
Non-trainable params: 0
__________________________________________________________________________________________________

Yet there is another solution which you can try: just have one embedding layer for all the categorical features. It involves some preprocessing though: you need to re-index all the categories to make them distinct from each other. For example, the categories in first categorical feature would be numbered from 1 to size_first_cat and then the categories in the second categorical feature would be numbered from size_first_cat + 1 to size_first_cat + size_second_cat and so on. However, in this solution all the categorical features would have the same embedding dimension since we are using only one embedding layer.


Update: Now that I think about it, you can also reshape the categorical features in data preprocessing stage or even in the model to get rid of TimeDistributed layers and the Reshape layer (and this may increase the training speed as well):

numerical_input = Input(shape=(n_steps, n_numerical_feats), name='numeric_input')
cat_inputs = []
for i in range(num_cats):
    cat_inputs.append(Input(shape=(n_steps,), name='cat' + str(i+1) + '_input'))

cat_embedded = []
for i in range(num_cats):
    embed = Embedding(cat_size[i], cat_embd_dim[i])(cat_inputs[i])
    cat_embedded.append(embed)

cat_merged = concatenate(cat_embedded)
merged = concatenate([numerical_input, cat_merged])
lstm_out = LSTM(64)(merged)

model = Model([numerical_input] + cat_inputs, lstm_out)

As for fitting the model, you need to feed each input layer separately with its own corresponding numpy array, for example:

X_tr_numerical = X_train[:,:,:n_numerical_feats]

# extract categorical features: you can use a for loop to this as well.
# note that we reshape categorical features to make them consistent with the updated solution
X_tr_cat1 = X_train[:,:,cat1_idx].reshape(-1, n_steps) 
X_tr_cat2 = X_train[:,:,cat2_idx].reshape(-1, n_steps)
X_tr_cat3 = X_train[:,:,cat3_idx].reshape(-1, n_steps)

# don't forget to compile the model ...

# fit the model
model.fit([X_tr_numerical, X_tr_cat1, X_tr_cat2, X_tr_cat3], y_train, ...)

# or you can use input layer names instead
model.fit({'numeric_input': X_tr_numerical,
           'cat1_input': X_tr_cat1,
           'cat2_input': X_tr_cat2,
           'cat3_input': X_tr_cat3}, y_train, ...)

If you would like to use fit_generator() there is no difference:

# if you are using a generator
def my_generator(...):

    # prep the data ...

    yield [batch_tr_numerical, batch_tr_cat1, batch_tr_cat2, batch_tr_cat3], batch_tr_y

    # or use the names
    yield {'numeric_input': batch_tr_numerical,
           'cat1_input': batch_tr_cat1,
           'cat2_input': batch_tr_cat2,
           'cat3_input': batch_tr_cat3}, batch_tr_y

model.fit_generator(my_generator(...), ...)

# or if you are subclassing Sequence class
class MySequnece(Sequence):
    def __init__(self, x_set, y_set, batch_size):
        # initialize the data

    def __getitem__(self, idx):
        # fetch data for the given batch index (i.e. idx)

        # same as the generator above but use `return` instead of `yield`

model.fit_generator(MySequence(...), ...)
today
  • 32,602
  • 8
  • 95
  • 115
  • Thank you very much. One more question about fitting and training the model. Can I use variable sequence length in batches? E.g. a batch of dimension ( batch_size , variable length , n_numerical_feats + num_cats )? Do I simply put this as X, and the model will know that `X[0] [:,:,:10]` should go to LSTM directly and `X[0] [:,:,: n_numerical_feats]` should go through their corresponding embeddings? E.g. I'm planning on having an 3 embeddings + lstms for each categorical variable and then merge it with numerical LSTM – GRS Oct 03 '18 at 15:11
  • I also attached a picture of what I have in my head. Also, could you tell me why we use `TimeDistributed`? – GRS Oct 03 '18 at 15:23
  • @GRS If you don't want to use `TimeDistributed` layer, you need to reshape the categorical inputs from `(n_steps,1)` to `(n_steps)` (either in the model or in data preprocessing stage). Actually, I think this way is better as well, therefore I have updated my answer to reflect this and also answer the other question you have about fitting. As for training with variable length, you can do it but the dimensions of each input batch must be determined (i.e. they cannot be `None`). Therefore you can either train with `batch_size=1` or pad the input samples (e.g. with 0) to make their length fixed. – today Oct 03 '18 at 15:58
  • Thanks, this helped a lot, one last thing, I'm a bit confused if I can use `fit_generator()`. I usually pass a Sequence class, where the generator returns a tuple of (batch_size, variable_sequence_length (1 to t), number_of_features), but in this case, I'm assuming no generator is possible? – GRS Oct 03 '18 at 16:06
  • @GRS Of course you can use it. I have updated my answer again (look at the end). – today Oct 03 '18 at 16:12
  • Perfect, thanks. I can just `return { X_dictionary }, y, sample_weights` as a tuple in the `__getitem__` method :) – GRS Oct 03 '18 at 16:14
  • @GRS I added that as well. Thanks for mentioning it. – today Oct 03 '18 at 16:22
  • Btw I tried to input `None` as number of time steps to allow a flexible number of time steps and the first model doesn't build. However everything is fine with the 2nd one. I just thought I'd let you know, you might find it interesting – GRS Oct 03 '18 at 16:28
  • @GRS That's because of the `Reshape` layer in the first solution. The given dimensions to this layer cannot be `None` (i.e. `n_steps`). You need to change it to this: `Reshape((-1, 160))(cat_merged)` or generally `Reshape((-1, sum_of_cat_embd_dim))(cat_merged)`. – today Oct 03 '18 at 16:36
  • Thanks again. The 2nd solutions seems more elegant and robust for sure, and the dimensions all make sense to me – GRS Oct 03 '18 at 16:43
  • @today Is it possible for you to provide a simpler example for the question here? https://stackoverflow.com/questions/51469446/keras-and-error-setting-an-array-element-with-a-sequence – jlewkovich Mar 19 '19 at 04:49
  • @today How should we adjust the code if categorical variables are one-hot-encoded and each has different lengths? – Arwen Sep 24 '20 at 05:06
  • @Arwen Do you mean `n_steps` is different for each categorical feature? Then, you should either pad/truncate them to same length, or instead use a separate `LSTM` layer for processing each of those and then concatenate the output of those LSTM layers. – today Sep 24 '20 at 07:02
  • Actually I mean what if each categorical variable is one-hot encoded and for example first variable has 18 categories, second has 20 categories, etc. Number of steps is same for all.@today – Arwen Sep 24 '20 at 15:39
  • @Arwen Oh, then see the `cat_size` variable definition in then answer. – today Sep 24 '20 at 19:37
  • @today First, thankyou for this good answer. In your last **update**, do you assume that categorical variables (eg `X_tr_cat1`, `X_tr_cat2`,`X_tr_cat3`) be the already converted to Indexes? (otherwise we can have one-hot encoded, as there were text features as `location` in them) – A.B Oct 21 '20 at 08:48
  • @A.B Yes, they should be encoded as integer indices since they are passed to `Embedding` layers. – today Oct 21 '20 at 09:13
  • Thankyou, make sense. (BTW, was looking for linkedin or other profile for any calloboration or to follow you, is it possible to have it on your profile? :) ) – A.B Oct 21 '20 at 09:17
  • @A.B I don't have a linkedin profile (nor a social media account), but feel free to get in touch using the email in my profile page. – today Oct 21 '20 at 09:24
  • Thankyou for reply, sure I will :) – A.B Oct 21 '20 at 09:25
0

One other solution I could think of is you could as well concat the numerical(after normalizing) and categorical features together even before you feed it to the lstm.

During the backprop alow the gradients to flow only in the embedding layer since by default the gradient will flow in both branches.

Sanjay
  • 169
  • 2
  • 9