0

Given a dataframe :

+---+----------+
|key|     value|
+---+----------+
|foo|       bar|
|bar|  one, two|
+---+----------+

Then I'd like to use the value column as entry to FPGrowth which must look like RDD[Array[String]]

val transactions: RDD[Array[String]] = df.select("value").rdd.map(x => x.getList(0).toArray.map(_.toString))

import org.apache.spark.mllib.fpm.{FPGrowth, FPGrowthModel}
val fpg = new FPGrowth().setMinSupport(0.01)
val model = fpg.run(transactions)

I get exception :

  org.apache.spark.SparkException: Job aborted due to stage failure: Task 7 in stage 141.0 failed 1 times, most recent failure: Lost task 7.0 in stage 141.0 (TID 2232, localhost): java.lang.ClassCastException: java.lang.String cannot be cast to scala.collection.Seq

Any suggestion welcome !

Toren
  • 6,648
  • 12
  • 41
  • 62
  • the problem: `df.select("value").rdd.map(x => x.getList(0)...`, because `x` is a string, it should be `x.getString(0)` – Raphael Roth Jan 29 '17 at 17:06
  • 1) `.getString(0)` change sequence to sequrnce of Char `org.apache.spark.SparkException: Items in a transaction must be unique but got WrappedArray(o, n, e, ,, t, w, o).` 2) The problem is the exception , that I cannot understand why – Toren Jan 29 '17 at 17:15
  • have you tried ..`val transactions =df.select("value").rdd.map(_.toString.split(","))` `transactions: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[23] at map at :35` . May be this should work as its returning an `RDD[Array[String]]` – Rajat Mishra Jan 29 '17 at 19:20

1 Answers1

3

Instead of val transactions: RDD[Array[String]] = df.select("value").rdd.map(x => x.getList(0).toArray.map(_.toString))

try using val transactions= df.select("value").rdd.map(_.toString.stripPrefix("[").stripSuffix("]").split(","))

It gives a desired ouptut as expected i.e. RDD[Array[String]]

val transactions= df.select("value").rdd.map(_.toString.stripPrefix("[").stripSuffix("]").split(","))
transactions: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[10] at map at <console>:33
scala> transactions.take(2)
res21: Array[Array[String]] = Array(Array(bar), Array(one, two))

To remove the "[" and "]" ,one can use stripPrefix and stripSuffix function before split function.

Rajat Mishra
  • 3,635
  • 4
  • 27
  • 41
  • I also tough that I'm missing `split` , after your suggestion splitting I get brackets `[` with sting itself `[one` or `two]` – Toren Jan 30 '17 at 10:17
  • @Toren To get data without "[" and "]", we can use stripPrefix and stripSuffix function. Have updated the code. – Rajat Mishra Jan 30 '17 at 10:56