0

I'm trying to run FPGrowth but actually I'm stumbling over the problem with the input types. Given the code:

%scala
// association rule learning for OFFLINE with FPGrowth from MLLib
import org.apache.spark.mllib.fpm.FPGrowth
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.functions._
import org.apache.spark.sql._
import org.apache.spark.mllib.fpm.PrefixSpan
import org.apache.spark.SparkContext
import org.apache.spark.api.java.JavaRDD
import org.apache.spark.api.java.function.FlatMapFunction
import org.apache.spark.mllib.linalg.Vectors

val dfoffline = spark.table("offlinetrx")
val products = dfoffline
.groupBy("Beleg")
.agg(
collect_set("Produkt") as "items")

// debugging
val columnProducts = products.select("items")
columnProducts.printSchema()
columnProducts.show()

this produces the following output:

root
|-- items: array (nullable = true)
|    |-- element: string (containsNull = true)

+--------------------+
|               items|
+--------------------+
|[19420.01, 46872.01]|
|[AEC003.01, AEC00...|
|  [BT102.01, BET103]|

The code continues with the transformation to a RDD and executing the FPGrowth

val rdd = columnProducts.rdd
val fpg = new FPGrowth().setMinSupport(0.2).setNumPartitions(6)
val model = fpg.run(rdd)

then Spark is telling me:

error: inferred type arguments [Nothing,org.apache.spark.sql.Row] do
not conform to method run's type parameter bounds [Item,Basket <:
Iterable[Item]]

val model = fpg.run(rdd)
notebook:74: error: type mismatch;:
found   : org.apache.spark.api.java.JavaRDD[org.apache.spark.sql.Row]
required: org.apache.spark.api.java.JavaRDD[Basket]

Then I have tried to map the data frame

val rdd = columnProducts.map{x:Row => x.getAs[List](0)}

But this results in another problem of:

error: kinds of the type arguments (List) do not conform to the
expected kinds of the type parameters (type T).
List's type parameters do not match type T's expected parameters:
type List has one type parameter, but type T has none
val rdd = columnProducts.map{x:Row => x.getAs[List](0)}

How do I infer the type Parameter (T) to the getAs List command ?

Or does anyone have another good idea how to actually solve the problem of requiring an RDD of Baskets but having an RDD of Rows ?

Thx you guys

Marco P.
  • 81
  • 5
  • 1
    You can use FPGrowth from [ml package](https://spark.apache.org/docs/2.3.0/ml-frequent-pattern-mining.html). Just change import to `import org.apache.spark.ml.fpm.FPGrowth` and feed columnProducts to model. – prudenko Sep 07 '18 at 12:52
  • great, thank you @prudenko – Marco P. Sep 07 '18 at 22:12

0 Answers0