1

The following is one column of a dataset that I'm trying to feature engineer:

+---+-----------------------------+
|Id |events_list                  |
+---+-----------------------------+
|1  |event1,event3,event2,event1  |
+---+-----------------------------+
|2  |event3,event2                |
+---+-----------------------------+

There are 3 possible event types and the order which they arrived is saved as a string. I've transformed the events column like so:

+---+--------------------+
|Id |event1|event2|event3|
+---+--------------------+
|1  |2     |1     |1     |
+---+--------------------+
|2  |0     |1     |1     |
+---+--------------------+

Preserving the count information but loosing the order information.

Q: is there a way to encode the order as a feature?

Update: for each row of events I calculate a score for that day, the model should predict future score for new daily events. Anyway, my events order and count affects the daily score.

Update: My dataset contains other daily information such as sessions count etc’ and currently my model is an LSTM digesting each row by date. I want to try and improve my prediction by adding the order info to the existing model.

desertnaut
  • 57,590
  • 26
  • 140
  • 166
Shlomi Schwartz
  • 8,693
  • 29
  • 109
  • 186
  • Thanks for your reply, please see my updates – Shlomi Schwartz Apr 18 '20 at 12:50
  • Have a look at https://medium.com/@Nithanaroy/encoding-fixed-length-high-cardinality-non-numeric-columns-for-a-ml-algorithm-b1c910cb4e6d there are some good examples, I would go with hash embedding. – Roni Gadot Apr 19 '20 at 06:24

2 Answers2

1

One option is to translate/transform the string directly by creating a meaningful mapping 1 --> 1 (i.e. one to one). In this case, preserving order is doable and has a meaning.

This is simple demo:

data = ['event1,event3,event2,event1', 'event2,event2', 'event1,event2,event3']

def mapper(data):
    result = []
    for d in data:
        events = d.replace(' ', '').split(',')
        v = 0
        for i, e in enumerate(events):
            # for each string: get the sum of char values,
            # normalized by their orders
            # here 100 is optional, just to make the number small
            v += sum(ord(c) for c in e) / (i + 100) 
        result.append(v)
    return result

new_data = mapper(data)
print(new_data)

Output:

[23.480727373137086, 11.8609900990099, 17.70393127548049]

Although the clashes probability is very low, there is no 100% guarantee that there will be no clashes at all for gigantic dataset.

Check this analysis:

# check for clashes on huge dataset
import random as r
import matplotlib.pyplot as plt

r.seed(2020)

def ratio_of_clashes(max_events):
    MAX_DATA = 1000000
    events_pool = [','.join(['event' + str(r.randint(1, max_events))
                         for _ in range(r.randint(1, max_events))])
                   for _ in range(MAX_DATA)]
    # print(events_pool[0:10])  # print few to see
    mapped_events = mapper(events_pool)
    return abs(len(set(mapped_events)) - len(set(events_pool))) / MAX_DATA * 100


n_samples = range(5, 100)
ratios = []
for i in n_samples:
    ratios.append(ratio_of_clashes(i))

plt.plot(n_samples, ratios)
plt.title('The Trend of Crashes with Change of Number of Events')
plt.show()

enter image description here

As a result, the less events or data you have, the lesser the clashes ratio, until it hits some threshold, then it flats out -- However it is after all not bad at all (Personally, I can live with it).


Update & Final Thoughts:

I just noticed that your are already using LSTM, thus the order extremely matters. In this case, I strongly suggest you Encode events into integers, then create a time series which perfectly fits in the LSTM, follow these steps:

  1. Pre-process each string and split them into events (as I did in the example).
  2. Fit LabelEncoder on them and transform them into integers.
  3. Scale the result into [0 - 1] by fitting MinMaxScaler.

You will end up with something like this:

'event1' : 1

'event2' : 2

'event3' : 3

. . .

'eventN' : N

and for 'event1,event3,event2,event3', it will become: [1, 3, 2, 3]. Scaling --> [0, 1, 0.5, 1].

The LSTM then is more than capable to figure out the order by nature. And forget about the dimensionality point, since it is LSTM which it is main job is to remember and optionally forget steps and orders of steps!.

Yahya
  • 13,349
  • 6
  • 30
  • 42
  • Thanks for the reply, so does that mean that `'event1,event2' == 'event2,event1'` because their sums are the same? – Shlomi Schwartz Apr 20 '20 at 09:12
  • @ShlomiSchwartz You're welcome Shlomi. Regarding your question: No, they will be different for sure (11.850, 11.851). However, we are assuming that events differ in one single character(which is not very realistic, though the results are unique), but in reality, I reckon your events have real different names, right? – Yahya Apr 20 '20 at 10:24
  • very clever idea! if I get it correctly the "magic" is dividing the sum by the index of the event, right? – Shlomi Schwartz Apr 20 '20 at 10:55
  • @ShlomiSchwartz Exactly, the sum is just to get *relatively* unique representation of the event, which is later *normalized* by the order of the event in that sequence. As a result, each sequence will be transformed into a unique numerical value based on the events, their number and their order. One last thing, please don't forget to normalize/standardize your data at the end before throwing them to LSTM (as a general approach). – Yahya Apr 20 '20 at 11:37
  • Normalize at the end for sure but do you think I should maybe normalize the values before returned by the mapper function according to their length? That way long list would not contain the "amount" information but just the order. for example, dividing by the length of the list `sum(ord(c) for c in e) / (i + 100) / len(events)` – Shlomi Schwartz Apr 20 '20 at 12:10
  • 1
    @ShlomiSchwartz The *amount* of information is already included. A sequence of 10 events for example is mapped into larger value than a sequence of 3 events. However, I see your points in case the lengths of two sequences are very close to each other (like 9 and 10) , in this case, if you want to normalize by the number of *events* per sequence (which is a very valid point indeed), you then need to divide the *sequence* by `/ len(events)` at the end like this: `result.append(v / len(events))`. I this case, sequences are normalized by the amount of information per one. – Yahya Apr 20 '20 at 12:18
0

One possibility could be a series of vectors representing the events that had occurred up until the nth event. n is the maximum number of events that can occur, the vector length is the number of possible events. This would implicitly encode the order of the events into a fixed size feature space.

+---+-----------------------------+
|Id |events_list                  |
+---+-----------------------------+
|1  |event1,event3,event2,event1  |
+---+-----------------------------+
|2  |event3,event2                |
+---+-----------------------------+



+---+--------------------------------------------------+
|Id | events_1  events_2  events_3  events_4  events_5 |
+---+--------------------------------------------------+
|1  | [1,0,0]   [1,0,1]   [1,1,1]   [2,1,1]   [2,1,1]  |
+---+--------------------------------------------------+
|2  | [0,0,1]   [0,1,1]   [0,1,1]   [0,1,1]   [0,1,1]  |
+---+--------------------------------------------------+

OR

fewer feature dimensions with the same information, would be to record which event occurred at event-step, n

+---+--------------------------------------------------+
|Id | event_1  event_2  event_3  event_4  event_5      |
+---+--------------------------------------------------+
|1  |   1         3        2        1        0         |
+---+--------------------------------------------------+
|2  |   3         2        0        0        0         |
+---+--------------------------------------------------+

This has fewer dimensions which is good but the possible disadvantage of not explicitly encoding final state. not knowing anything about the problem itself or which model you plan to use, it's difficult to know if that will matter or not.

ezekiel
  • 427
  • 5
  • 20
  • Thank you for your reply, unfortunately with that approach if one row has 1000 events while the other rows have only 2 events the column count would be 1000 for all. – Shlomi Schwartz Apr 18 '20 at 13:17
  • yeah that's true, but that requirement of a fixed size feature space for every row probably exists, depending on your model? Unless you are using a recurrent neural network or something? but if that were the case I doubt you'd be asking a question like this. Give us a lot more specific info about the problem and the constraints and the model you want to use and maybe can give an example more particular to your situation. – ezekiel Apr 18 '20 at 13:25