Spark MlLib Frequent Pattern Mining, type parameter bounds

Question

I have data in a key,value pairing the key is the column index and the value is whatever is in that columns value. My original file is just a csv. So I have the following:

val myData = sc.textFile(file1)
  .map(x => x.split('|'))
  .flatMap(x => x.zipWithIndex)
  .map(x => x.swap)
  .groupByKey().cache

This puts my data into myData: Array[(Int, Iterable[String])]

val fpg = new FPGrowth()
  .setMinSupport(0.2)
  .setNumPartitions(1)

val model = fpg.run(myData)

I get the following issues:

<console>:29: error: inferred type arguments [Nothing,(Int, Iterable[String])] do not conform to method run's type parameter bounds [Item,Basket <: Iterable[Item]]

I am trying to learn how to use MlLib, and don't quite understand the issue. I've also tried removing the index and .map(x=>x._2) and making sets of just the iterable data but that also fails.

zero323 · Accepted Answer · 2015-11-03T23:51:32.350

2

This should solve your problem:

fpg.run(myData.values.map(_.toArray))

Basically FPGrowth requires an Array of Items. Passing output from groupByKey won't work because it contains Tuple2, output from map(x => x._2) won't work because value is not an Array.

Each element of the RDD represents a single basket and should contain only the unique items. If you expect duplicates you can use _.toSet.toArray or _distinct.toArray.

edited Nov 03 '15 at 23:51

answered Nov 03 '15 at 23:32

zero323

322,348
103
959
935

Thanks, also I am getting this org.apache.spark.SparkException (Items in a transaction must be unique but got WrappedArray(T, T, T, T, ..) [duplicate 2] is it because the array contains all T's and ntohing else? – theMadKing Nov 03 '15 at 23:36
No, it is because you have duplicate entries. Input like this doesn't make sense for `FPGrowth`. – zero323 Nov 03 '15 at 23:38
In a basket yes. `fpg.run(myData.values.map(_.toSet.toArray))` – zero323 Nov 03 '15 at 23:41
So something like this would be more appropriate? val model = fpg.run(myData.values.distinct.map(_.toArray)) – theMadKing Nov 03 '15 at 23:42
No. Each entry in a RDD you pass to `FPGrowth` is a single basket. You don't want duplicates in a basket because every basket which contains item `T` has to contain `T`. There is no useful rule here :) You want duplicate baskets because this is an useful information. This is one of the things that make itemsets frequent. – zero323 Nov 03 '15 at 23:45

Spark MlLib Frequent Pattern Mining, type parameter bounds

1 Answers1