One option is to translate/transform the string directly by creating a meaningful mapping 1 --> 1 (i.e. one to one).
In this case, preserving order is doable and has a meaning.
This is simple demo:
data = ['event1,event3,event2,event1', 'event2,event2', 'event1,event2,event3']
def mapper(data):
result = []
for d in data:
events = d.replace(' ', '').split(',')
v = 0
for i, e in enumerate(events):
# for each string: get the sum of char values,
# normalized by their orders
# here 100 is optional, just to make the number small
v += sum(ord(c) for c in e) / (i + 100)
result.append(v)
return result
new_data = mapper(data)
print(new_data)
Output:
[23.480727373137086, 11.8609900990099, 17.70393127548049]
Although the clashes probability is very low, there is no 100% guarantee that there will be no clashes at all for gigantic dataset.
Check this analysis:
# check for clashes on huge dataset
import random as r
import matplotlib.pyplot as plt
r.seed(2020)
def ratio_of_clashes(max_events):
MAX_DATA = 1000000
events_pool = [','.join(['event' + str(r.randint(1, max_events))
for _ in range(r.randint(1, max_events))])
for _ in range(MAX_DATA)]
# print(events_pool[0:10]) # print few to see
mapped_events = mapper(events_pool)
return abs(len(set(mapped_events)) - len(set(events_pool))) / MAX_DATA * 100
n_samples = range(5, 100)
ratios = []
for i in n_samples:
ratios.append(ratio_of_clashes(i))
plt.plot(n_samples, ratios)
plt.title('The Trend of Crashes with Change of Number of Events')
plt.show()

As a result, the less events or data you have, the lesser the clashes ratio, until it hits some threshold, then it flats out -- However it is after all not bad at all (Personally, I can live with it).
Update & Final Thoughts:
I just noticed that your are already using LSTM, thus the order extremely matters. In this case, I strongly suggest you Encode events into integers, then create a time series which perfectly fits in the LSTM, follow these steps:
- Pre-process each string and split them into events (as I did in the example).
- Fit LabelEncoder on them and transform them into integers.
- Scale the result into [0 - 1] by fitting MinMaxScaler.
You will end up with something like this:
'event1' : 1
'event2' : 2
'event3' : 3
.
.
.
'eventN' : N
and for 'event1,event3,event2,event3', it will become: [1, 3, 2, 3].
Scaling --> [0, 1, 0.5, 1].
The LSTM then is more than capable to figure out the order by nature. And forget about the dimensionality point, since it is LSTM which it is main job is to remember and optionally forget steps and orders of steps!.