1

I have data in a key,value pairing the key is the column index and the value is whatever is in that columns value. My original file is just a csv. So I have the following:

val myData = sc.textFile(file1)
  .map(x => x.split('|'))
  .flatMap(x => x.zipWithIndex)
  .map(x => x.swap)
  .groupByKey().cache

This puts my data into myData: Array[(Int, Iterable[String])]

val fpg = new FPGrowth()
  .setMinSupport(0.2)
  .setNumPartitions(1)

val model = fpg.run(myData)

I get the following issues:

<console>:29: error: inferred type arguments [Nothing,(Int, Iterable[String])] do not conform to method run's type parameter bounds [Item,Basket <: Iterable[Item]]

I am trying to learn how to use MlLib, and don't quite understand the issue. I've also tried removing the index and .map(x=>x._2) and making sets of just the iterable data but that also fails.

zero323
  • 322,348
  • 103
  • 959
  • 935
theMadKing
  • 2,064
  • 7
  • 32
  • 59

1 Answers1

2

This should solve your problem:

fpg.run(myData.values.map(_.toArray))

Basically FPGrowth requires an Array of Items. Passing output from groupByKey won't work because it contains Tuple2, output from map(x => x._2) won't work because value is not an Array.

Each element of the RDD represents a single basket and should contain only the unique items. If you expect duplicates you can use _.toSet.toArray or _distinct.toArray.

zero323
  • 322,348
  • 103
  • 959
  • 935
  • Thanks, also I am getting this org.apache.spark.SparkException (Items in a transaction must be unique but got WrappedArray(T, T, T, T, ..) [duplicate 2] is it because the array contains all T's and ntohing else? – theMadKing Nov 03 '15 at 23:36
  • No, it is because you have duplicate entries. Input like this doesn't make sense for `FPGrowth`. – zero323 Nov 03 '15 at 23:38
  • In a basket yes. `fpg.run(myData.values.map(_.toSet.toArray))` – zero323 Nov 03 '15 at 23:41
  • So something like this would be more appropriate? val model = fpg.run(myData.values.distinct.map(_.toArray)) – theMadKing Nov 03 '15 at 23:42
  • No. Each entry in a RDD you pass to `FPGrowth` is a single basket. You don't want duplicates in a basket because every basket which contains item `T` has to contain `T`. There is no useful rule here :) You want duplicate baskets because this is an useful information. This is one of the things that make itemsets frequent. – zero323 Nov 03 '15 at 23:45