I am working with big datasets in MySQL (combined with Java) and trying to implement a Frequent Itemset algorithm. A recurring aspect of the algorithm is counting how many times a set of items (an item is a random integer) occurs in the dataset.
Take for example this small dataset T:
ID | COL1 | COL2 | COL3 | COL4 | COL5 |
---------------------------------------
1 | 8 | 35 | 42 | 12 | 27 |
2 | 22 | 42 | 35 | 8 | NULL |
3 | 18 | 22 | 8 | NULL | NULL |
4 | 42 | 12 | 27 | 35 | 8 |
5 | 18 | 27 | 12 | 22 | NULL |
And this table T2:
COL1 | COL2 |
-------------
35 | 27 |
22 | 8 |
42 | 8 |
18 | 35 |
35 | 42 |
What I want as result is the following table (it can be an answer to a query as well):
COL1 | COL2 | COUNT |
---------------------
35 | 27 | 2 |
22 | 8 | 2 |
42 | 8 | 3 |
18 | 35 | 0 |
35 | 42 | 3 |
So I want to count every occurrence of each row of table T2 in table T. Basically how many times is a row of T2 a subset of rows in T
This has to be done in every generation of the algorithm. This is a very small example, eventually the same has to be done with T3 (rows with 3 items), T4 (rows with 4 items), etc. Table T stays the same. I also have to take into account that the order doesn't matter ( |35, 27| = |27, 35|) and that they will probably not be in columns next to each other in T
Is it possible to do this without going over the dataset too many times (whereas too many = more than the amount of rows from T2)? Might it be better to represent a row as a tuple (e.g. (35, 27)) so it becomes one item?