The problem I'm trying to solve is to find the support of each itemset in transactional data.
For example,
transactions = [
'b c d',
'a g' ,
'a c d e',
'e f h',
'a b c g h',
'd' ,
'a e g h',
'b c d',
'a b f g h',
'a c d g',
]
will have [2, 5, 1, 1, 1, 5, 1, 2, 1, 1]
So basically for the second transaction a, g
, it is a subset of other transactions like 'a g'
, 'a b c g h'
, 'a e g h'
, 'a b f g h'
, 'a c d g'
and hence the count is 5.
Now, initially, I was converting this dataset to a kind of One Hot Encoded transaction using the mlxtend transactional encoder. And used something like
df.progress_apply(lambda x: (df.iloc[:, np.where(x==1)[0]].sum(1)==len(np.where(x==1)[0])).sum(), axis=1)
to get the values.
The idea is like slice the matrix/df with the elements of the present row and then sum across rows. The cases where it is the same as the length of elements of the present row is a subset and hence count it.
However, this worked fine for smaller datasets, and then when I came across the kosarak, I cannot have a dense representation because of OOM error. So, I switched back to countVectorizer and generated a sparse representation and then used a similar logic as the previous one.
Now the issue is, the scipy sparse is 4x slow when doing sum on sparse than dense with a run time of
164 ms ± 22.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Even using sets to solve the problem didn't improve things much.
As far, this was my approach and I believe it has O(n2) complexity. Is there any better algorithm/package to speed things up.
Any help is appreciated. Thanks in advance.