0

Please, can you help me ? I have an 80 CSV files dataset and a cluster of one master and 4 slaves. I want to read the CSV files in a dataframe and parallelize it on the four slaves. After that, I want to filter the dataframe with a group by. In my spark queries, the result contains columns "code_ccam" and "dossier" grouped by ("code_ccam","dossier"). I want to use the FP-Growth algorithm to detect sequences of "code_ccam" which are repeated by "folder". But when I use the FPGrowth.fit() command, I have the following error :

"error: type mismatch;
found : org.apache.spark.rdd.RDD[org.apache.spark.sql.Row]
required: org.apache.spark.sql.Dataset[_]"

Here are my spark commands:

val df = spark.read.option("header", "true").csv("file:///home/ia/Projet-Spark-ace/Donnees/Fichiers CSV/*.csv")
import org.apache.spark.sql.functions.{concat, lit}
val df2 = df.withColumn("dossier", concat(col("num_immatriculation"), lit(""), col("date_acte"), lit(""), col("rang_naissance"), lit(""), col("date_naissance")))
val df3 = df2.drop("num_immatriculation").drop("date_acte").drop("rang_naissance").drop("date_naissance")
val df4 = df3.select("dossier","code_ccam").groupBy("dossier","code_ccam").count()
val transactions = df4.agg(collect_list("code_ccam").alias("codes_ccam")).rdd.map(x => x)
import org.apache.spark.ml.fpm.FPGrowth
val fpgrowth = new FPGrowth().setItemsCol("code_ccam").setMinSupport(0.5).setMinConfidence(0.6)
val model = fpgrowth.fit(transactions)
  • Remove `.rdd.map(x => x)`. – 10465355 Feb 20 '19 at 09:31
  • Thank you. There is no more the above error. But when I fit the model to the data, I've got the error : `"org.apache.spark.SparkException: Items in a transaction must be unique but got WrappedArray(QZQX004, QZFA020,"` I've limited the data to the 700000 first rows to avoid an OOM. Here is my code for limiting the data and fitting the model : `val df4=df3.select("dossier","code_ccam").limit(700000).groupBy("dossier","code_ccam").count()` – Malik Berrada Feb 20 '19 at 10:11
  • `val transactions4 = df4.agg(collect_list("code_ccam").alias("codes_ccam")) val model = fpgrowth.fit(transactions4)` – Malik Berrada Feb 20 '19 at 10:18
  • Also `df4.agg(collect_list(...))` - shouldn't you rather have some grouping there. Also `collect_list` -> `collect_set`. – 10465355 Feb 20 '19 at 11:07

1 Answers1

0

Tkank you very much. It worked. I replaced collect_list by collect_set.