Algorithms for Mining Tuples of Data on huge sample space

Question

I read that Apriori algorithm is used to fetch association rules from the dataset like a set of tuples. It helps us to find the most frequent 1-itemsets, 2-itemsets and so-on. My problem is bit different. I have a dataset, which is a set of tuples, each of varying size - as follows :

(1, 234, 56, 32) (25, 4575, 575, 464, 234, 32) . . . different size tuples

The domain for entries is huge, which means that I cannot have a binary vector for each tuple, that tells me if item 'x' is present in tuple. Hence, I do not see Apriori algorithm fitting here.

My target is to answer questions like :

Give me the ranked list of 5 numbers, that occur with 234 most of the time
Give me the top 5 subsets of size 'k' that occur most frequently together

Requirements : Exact representation of numbers in output (not approximate), Domain of numbers can be thought of as 1 to 1 billion.

I have planned to use the simple counting methods, if no standard algorithm fits here. But, if you guys know about some algorithm that can help me, please let me know

How huge is huge? I'm not very familiar with the Apriori algorithm, but I've read about its use on feature vectors of lenghts ranging in the millions. A good sparse representation helps (which you already seem to have). — Fred Foo, Oct 09 '12 at 15:34
Do you want an exact or approximate representation of your data? i.e., are you willing to accept small errors on the answers to your queries? Also, do you have any prior knowledge on how numbers will be associated (not on specific numbers, but on the structure of their distribution)? — Bitwise, Oct 09 '12 at 15:42
Also, not that your representation is a equivalent to a hypergraph, you might want to look at algorithms/implementations for those. — Bitwise, Oct 09 '12 at 15:43
Main Problem is that - The integers comprising the tuples can range from 1 to huge number, say billion - In this scenario, I cannot keep a vector (1,0,0,0,1,0,0,....) as it would be overkill. — Code4Fun, Oct 09 '12 at 15:48

score 2 · Answer 1 · answered Oct 09 '12 at 16:05

2

I have worked with data mining in Apriori. The question is, do you have ALL those items present? How many individual item IDs do you actually have? I understand the item IDs may range over a large domain, but perhaps they're not all present. In that case a sparse market basket representation may still be good for you, and you would be able to use Apriori. Setting your minimum support and confidence to high values will also eliminate a lot of low-priority links. I use the Orange library for my data mining requirements.

answered Oct 09 '12 at 16:05

Yaelgro

105
6

Exactly, all the itemIds are not present in the dataset. Infact, I want to use this dataset not only for analysis, but also for recommendation like "Predict 3 numbers that can be seen with input number 'X' etc" – Code4Fun Oct 09 '12 at 16:13
Try using the Orange [AssociationRulesSparseInducer](http://orange.biolab.si/doc/reference/Orange.associate/), which induces frequent itemsets and association rules from sparse data sets. It also has a pretty clear [association rule tutorial](http://orange.biolab.si/doc/ofb/assoc.htm). – Yaelgro Oct 09 '12 at 16:36

score 1 · Answer 2 · answered Oct 09 '12 at 19:50

For Apriori, you do not need to have tuples or vectors. It can be implemented with very different data types. The common data type is a sorted item list, which could as well look like 1 13 712 1928 123945 191823476 stored as 6 integers. This is essentially equivalent to a sparse binary vector and often very memory efficient. Plus, APRIORI is actually designed to run on data sets too large for your main memory!

Scalability of APRIORI is a mixture of the number of transactions and the number of items. Depending of how they are, you might prefer different data structures and algorithms.

Algorithms for Mining Tuples of Data on huge sample space

2 Answers2