1

I am working with big datasets in MySQL (combined with Java) and trying to implement a Frequent Itemset algorithm. A recurring aspect of the algorithm is counting how many times a set of items (an item is a random integer) occurs in the dataset.

Take for example this small dataset T:

ID | COL1 | COL2 | COL3 | COL4 | COL5 |
---------------------------------------
1  |   8  |  35  |  42  |  12  |  27  |
2  |  22  |  42  |  35  |  8   | NULL |
3  |  18  |  22  |   8  | NULL | NULL |
4  |  42  |  12  |  27  |  35  |  8   |
5  |  18  |  27  |  12  |  22  | NULL |

And this table T2:

COL1 | COL2 |
-------------
35   | 27   |
22   | 8    |
42   | 8    |
18   | 35   |
35   | 42   |

What I want as result is the following table (it can be an answer to a query as well):

COL1 | COL2 | COUNT |
---------------------
35   | 27   | 2     |
22   | 8    | 2     |
42   | 8    | 3     |
18   | 35   | 0     |
35   | 42   | 3     |

So I want to count every occurrence of each row of table T2 in table T. Basically how many times is a row of T2 a subset of rows in T

This has to be done in every generation of the algorithm. This is a very small example, eventually the same has to be done with T3 (rows with 3 items), T4 (rows with 4 items), etc. Table T stays the same. I also have to take into account that the order doesn't matter ( |35, 27| = |27, 35|) and that they will probably not be in columns next to each other in T

Is it possible to do this without going over the dataset too many times (whereas too many = more than the amount of rows from T2)? Might it be better to represent a row as a tuple (e.g. (35, 27)) so it becomes one item?

  • "whereas too many = more than the amount of rows from T2" doesn't look like it what i can think of is generating a cvs and use FIND_IN_SET to find them but you can't use indexes if anny so this will not scale on large dataset's or use UNION ALL on the T table to convert the columns into records and JOIN with the columns off the T2 table but that will only scale on large datasets if the columns col1 and col2 both are indexed.problem with this approach is that UNION ALL will use a memory table to hold the results. – Raymond Nijland Mar 21 '18 at 16:54
  • Representing the row as tuple would just be compounding your problem; the issue is that T should be two tables `T` (with set_id), and `T values` (with set_id, value, and perhaps "sequence" if that matters.)... T2-T4 should be similarly restructured. – Uueerdo Mar 21 '18 at 17:20

2 Answers2

1

If you can restructure your data to one value per row, for T and TN, something like this should work for all TN at once.

SELECT n_id, COUNT(CASE WHEN matches = n_count THEN v_id ELSE NULL) AS occurences
FROM (
  SELECT n.n_id, v.set_id AS v_id, n.n_count, COUNT(*) AS matches
  FROM (SELECT n_id, COUNT(*) AS n_count FROM tN GROUP BY id) AS n
  INNER JOIN tN AS nv ON n.n_id = nv.n_id
  LEFT JOIN T_VALUES AS v ON nv.value = v.value
  GROUP BY n.n_id, v.set_id, n.n_count
) AS subQ;

If you need the TN values in your final results, something like this would come close.

SELECT n_id, n_values, COUNT(CASE WHEN matches = n_count THEN v_id ELSE NULL) AS occurences
FROM (
  SELECT n.n_id, n.n_count, n.n_values, v.set_id AS v_id, COUNT(*) AS matches
  FROM (
       SELECT n_id, COUNT(*) AS n_count 
          , GROUP_CONCAT(n.value) AS n_values
       FROM tN 
       GROUP BY id
  ) AS n
  INNER JOIN tN AS nv ON n.n_id = nv.n_id
  LEFT JOIN T_VALUES AS v ON nv.value = v.value
  GROUP BY n.n_id, n.n_count, n.n_values, v.set_id
) AS subQ;

Note: you can probably get away without a subquery, but could end up having the database calculate the same n_count and n_values repeatedly for each row of T.

Uueerdo
  • 15,723
  • 1
  • 16
  • 21
  • Thank you, I have not actually used this particular solution, but it showed me I had to restructure my dataset. There's an explanation below how. – DataWarrior Niño Mar 28 '18 at 15:06
0

Because of Uueerdo's answer I realised I had to use a different structure, so instead of using the table T:

ID | COL1 | COL2 | COL3 | COL4 | COL5 |
---------------------------------------
1  |   8  |  35  |  42  |  12  |  27  |
2  |  22  |  42  |  35  |  8   | NULL |
3  |  18  |  22  |   8  | NULL | NULL |
4  |  42  |  12  |  27  |  35  |  8   |
5  |  18  |  27  |  12  |  22  | NULL |

I now use Tnew:

ID | Item|
1  | 8   |
1  | 35  |
1  | 42  |
.  |  .  |
.  |  .  |
.  |  .  |

This works way easier in SQL, you can use Group By and Join to get the result needed. The query works with any number of items with the same ID. Also, you don't have to use the value NULL and the dataset is easier to create

If anyone wants to know the query I eventually used, please let me know (bit of work to come up with good tablenames and make it clear and understandable).