0

I am using the Package ‘arules’ to mine frequent itemsets in my big data, but I cannot find suitable methods for discretization.

As the example in Package ‘arules’, several basic unsupervised methods can be used in the function ‘discretization’, but I want to estimate optimal number of categories in my large dataset, it seems more reasonable than assigning the number of categories.

Can you give me good advices for this, thanks.

@Michael Hahsler

Pan
  • 193
  • 1
  • 2
  • 11

1 Answers1

0

I think there is little guidance on this for unsupervised discretization. Look at the histogram for each variable and decide manually. For k-means you could potentially use strategies to find k using internal validation techniques (i.e., elbow method). For supervised discretization there exist methods that will help you decide. Maybe someone else can help here.

Michael Hahsler
  • 2,965
  • 1
  • 12
  • 16
  • Thanks for reply, my data is too big, so when I use the k-means, there have Warning message: Quick-TRANSfer stage steps exceeded maximum (= 93441300) – Pan Jan 31 '18 at 18:06
  • Take a sample, apply k-means discretization with `onlycuts=TRUE` and then used the `fixed` method with the returned cuts on all the data. – Michael Hahsler Feb 01 '18 at 18:55
  • Thanks for reply, in your method, I must estimate the optimal number of categories, am I right? – Pan Feb 01 '18 at 21:36
  • Yes, you have to specify the number. – Michael Hahsler Feb 02 '18 at 21:55