6

The problem I'm trying to solve is to find the support of each itemset in transactional data.

For example,

transactions = [
    'b c d',
    'a g' ,
    'a c d e',
    'e f h',
    'a b c g h',
    'd' , 
    'a e g h',
    'b c d',
    'a b f g h',
    'a c d g',
]

will have [2, 5, 1, 1, 1, 5, 1, 2, 1, 1]

So basically for the second transaction a, g, it is a subset of other transactions like 'a g', 'a b c g h', 'a e g h', 'a b f g h', 'a c d g' and hence the count is 5.

Now, initially, I was converting this dataset to a kind of One Hot Encoded transaction using the mlxtend transactional encoder. And used something like

df.progress_apply(lambda x: (df.iloc[:, np.where(x==1)[0]].sum(1)==len(np.where(x==1)[0])).sum(), axis=1)

to get the values.

The idea is like slice the matrix/df with the elements of the present row and then sum across rows. The cases where it is the same as the length of elements of the present row is a subset and hence count it.

However, this worked fine for smaller datasets, and then when I came across the kosarak, I cannot have a dense representation because of OOM error. So, I switched back to countVectorizer and generated a sparse representation and then used a similar logic as the previous one.

Now the issue is, the scipy sparse is 4x slow when doing sum on sparse than dense with a run time of

164 ms ± 22.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Even using sets to solve the problem didn't improve things much.

As far, this was my approach and I believe it has O(n2) complexity. Is there any better algorithm/package to speed things up.

Any help is appreciated. Thanks in advance.

lu5er
  • 3,229
  • 2
  • 29
  • 50
  • The timing is for the example ? – Dani Mesejo Dec 02 '20 at 09:10
  • So I have looked up the kosarak dataset and I have a question: Does the order of the transactions in each line matter? (e.g. is 3, 5 equivalent to 5, 3). From your usage of sets I would say 'no' is the answer, but I can't be sure. – Shamis Dec 02 '20 at 13:21
  • And another question - some of the lines do repeat. What should be the approach to the duplicities? (Ignoring them is perfectly possible and I'm not sure if it's worth the extra comparison to cache/delete the results.) – Shamis Dec 02 '20 at 13:46
  • Could you try a divide and conquer approach? sort by length, count the duplicates, check only against longer strings, memoize the results (I mean, if `l9` is a subset of `l11`, then if `l5` is a subset of `l9` then it's also a subset of `l11`). – gboffi Dec 03 '20 at 17:02

4 Answers4

1

Since 2**26 is well below the integer limit on 32-bit integers, you can do this:

digitize = lambda x: np.in1d(list(string.ascii_lowercase), x.split()) @ 2 ** np.arange(26)

digitize converts the strings of letters into a unique bitwise integer for each set of letters. Since the data is bitwise, it can be compared with bit arithmetic.

trans = np.array([digitize(t) for t in transactions])

Out[]: array([ 14,  65,  29, 176, 199,   8, 209,  14, 227,  77], dtype=int32)

(np.bitwise_and.outer(tr, tr) == tr).sum(0)  #bitwise definition of subset, summed over entries

Out[]: array([2, 5, 1, 1, 1, 5, 1, 2, 1, 1])

you could easily make a column of trans and then apply the bitwise function to get your desired output. Should reduce memory usage by not storing those big onehots as well.

Daniel F
  • 13,620
  • 2
  • 29
  • 55
  • *facepalm* the problem is `np.bitwise_and.outer(tr, tr)` will end up huge. Well I'll leave it anyway. – Daniel F Dec 02 '20 at 10:30
1

You can use numba to speed up your computation over a dense array. Start with the one-hot encoding.

transactions = list(map(str.split, transactions))
# First compute indices for each element of transactions
indices = dict(item[::-1] for item in enumerate(set(x for t in transactions for x in t)))
# Make the matrix len(transactions)-by-len(indices)
A = np.zeros((len(transactions), len(indices)), bool)
# Fill in the matrix
for r, item in enumerate(transactions):
    A[r, [indices[t] for t in item]] = True

Now you can dot each row of A.T by A, threshold it, and sum the elements that pass:

@numba.njit
def check_subsets(A):
    output = np.empty(len(A), dtype=np.int_)
    for r, row in enumerate(A):
        output[r] = np.sum((A * row).sum(1) >= row.sum())
    return output

>>> check_subsets(A)
array([2, 5, 1, 1, 1, 5, 1, 2, 1, 1])

This does make things faster for your original dataset:

%timeit check_subsets(A)
5.2 µs ± 343 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit (np.matmul(A, A.T, dtype=int) >= A.sum(1, keepdims=True)).sum(1)
7.86 µs ± 39 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

The improvement does not hold for large datasets, but the memory footprint is much smaller.

np.random.seed(42)
B = np.random.randint(2, size=(10000, 26), dtype=bool)
%timeit check_subsets(B)
4.48 s ± 62.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit (np.matmul(B, B.T, dtype=int) >= B.sum(1, keepdims=True)).sum(1)
2.31 s ± 96.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Both benchmarks were run after the initial compilation phase of numba's JIT compiler. Note that the naive implementation without buffers is approximately 5% slower than the one shown.

Mad Physicist
  • 107,652
  • 25
  • 181
  • 264
0

My small attempt

If your current approach is ~164ms per loop, this one goes back to the *8 efficiency. Unfortunately I can't lay claims to anything genius and I'm afraid its still too slow. I just pre-built all the sets, then run in the most straightforward manner with the issubset as @solid.py . The difference in building the sets beforehands and simply using a for-loop instead of a function call was 6 times.

Current timing of the one set check is ~22ms +-2ms or something like that. I've been testing directly on the kosarak dataset so I hope there is only one dataset with such a name.

I have tried few "smarter" ways to kill off the implausible options, unfortunately all of them ended up slower than this "stupid" and direct one.

Few of the ways that might be actually useful:

  • sort the sets by the size, then compute the matches only with those of >= length. The length check is the first one in the .issubset anyways.
    Since first ~30 000 sets are one-transaction-only and another ~35000 sets consist of two transactions this could mean removing ~30% of the computation. Maybe more since the few-transaction-sets could be cached for further improvement.

  • This leads to caching of the results - at least the short ones. Creating a 1:{2:{}} structure is a rather cheap and it allows you to reuse the result. Using it even on the unsorted values resulted in ~1,5ms or so increase in performance. It is not much but it could be even more with the sorting. It is also possible to cut off this caching when the sets get bigger(and thus the probability of having the result cached grows smaller).
    Generally there are several transaction that repeat several hundred if not thousand times. This would help cut them down, further lessening the n in O(n^2) Unfortunately I don't have anything to lower the complexity by itself.

  • Expanding on the caching - sorting and counting the sets beforehands could be also used to replace each set with a tuple (set, count). This would remove the need for caching ^ altogether and would remove the most of the unnecessary computations.

     import csv
     import time
    
     reader = csv.reader(open('kosarak.csv'), delimiter=' ')
     dataLines = []
     for line in reader:
         dataLines.append(set(map(int, line)))
    
     results = []
     count = 0
     totalTime = 0
     for line1 in dataLines:
         r1 = 0
         t1 = time.time_ns()
         for line2 in dataLines:
             if line1.issubset(line2):
                 r1 += 1
    
         t2 = time.time_ns()
         results.append(r1)
         totalTime += (t2 - t1) / 1000000
         count += 1
         if (count % 100) == 0:
             print("$$$$$$$$$$$$$")
             print(totalTime)
             print(totalTime / count)
             print(count)
    
Shamis
  • 2,544
  • 10
  • 16
0

If possible python's set arithmetic usually is pretty decent and does not involve any convoluted binarizing logic, that is arguably harder to read/understand.

Just a suggestion to build upon:

transactions = [
    'b c d',
    'a g' ,
    'a c d e',
    'e f h',
    'a b c g h',
    'd', 
    'a e g h',
    'b c d',
    'a b f g h',
    'a c d g',
]
transactions = list(map(lambda x: x.replace(' ', ''), transactions))
print(transactions) # ['bcd', 'ag', 'acde', 'efh', 'abcgh', 'd', 'aegh', 'bcd', 'abfgh', 'acdg']

transactions_set = list(map(set, transactions))
counts = [sum(set(elem).issubset(s) for s in transactions_set) for elem in transactions]
print(counts) # [2, 5, 1, 1, 1, 5, 1, 2, 1, 1]
Stefan B
  • 1,617
  • 4
  • 15