1

So I have a variable data which is a RDD[Array[String]]. I want to iterate over it and compare adjacent elements. To do this I must create a dataset from the RDD.

I try the following, sc is my SparkContext:

import org.apache.spark.sql.SQLContext

val sqc = new SQLContext(sc)
val lines = sqc.createDataset(data)

And I get the two following errors:

Error:(12, 34) Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing sqlContext.implicits._ Support for serializing other types will be added in future releases. val lines = sqc.createDataset(data)

Error:(12, 34) not enough arguments for method createDataset: (implicit evidence$4: org.apache.spark.sql.Encoder[Array[String]])org.apache.spark.sql.Dataset[Array[String]]. Unspecified value parameter evidence$4. val lines = sqc.createDataset(data)

Sure, I understand I need to pass an Encoder argument, however, what would it be in this case and how do I import Encoders? When I try myself it says that createDataset does not take that as argument.

There are similar questions, but they do not answer how to use the encoder argument. If my RDD is a RDD[String] it works perfectly fine, however in this case it is RDD[Array[String]].

Alper t. Turker
  • 34,230
  • 9
  • 83
  • 115
osk
  • 790
  • 2
  • 10
  • 31
  • 2
    `import sqc.implicits._` – philantrovert Dec 04 '17 at 08:29
  • I dont consider it duplicate since I have already read through those questions. – osk Dec 04 '17 at 08:31
  • So if I import it, how do I use the encoder? (What do I pass to the 2nd argument?) – osk Dec 04 '17 at 08:31
  • What makes your question different from the possible duplicate given by philantrovert? You shouldn't have to consider any encoder when the import is done and you do not need any second argument. – Shaido Dec 04 '17 at 08:42
  • Read my last edit of my post. It works for RDD[String] but not for RDD[Array[String]]. – osk Dec 04 '17 at 08:43
  • 1
    Once the implicits are in scope, Spark will automatically convert whatever it can. If you want to be more specific about it, you can use : `sqc.createDataset(rdd)(newStringArrayEncoder)` – philantrovert Dec 04 '17 at 08:44
  • What do I need to import to use newStringArrayEncoder? If I import org.apache.spark.sql.SQLImplicits it does not recognize it – osk Dec 04 '17 at 08:46
  • 1
    `import sqlContext.implicits._` for spark ver < 2 and `spark.implicits._` for Spark2+ – philantrovert Dec 04 '17 at 08:47
  • Sorry for a lot of questions, I am new to this. I have the following code: import sqc.implicits._ val lines = sqc.createDataset(data)(newStringArrayEncoder) It still cannot resolve the symbol. I am using Spark 1.6. – osk Dec 04 '17 at 08:50
  • I'm watching where is this going before I close it as a dupe. – eliasah Dec 04 '17 at 09:25
  • So after import sqc.implicits._ I try newStringArrayEncoder. However this does not exist, sqc.implicits only has standard datatypes and a Product type – osk Dec 04 '17 at 09:34

1 Answers1

1

All of the comments in the question are trying to tell you the following things

You say you have RDD[Array[String]], which I create by doing the following

val rdd = sc.parallelize(Seq(Array("a", "b"), Array("d", "e"), Array("g", "f"), Array("e", "r")))   //rdd: org.apache.spark.rdd.RDD[Array[String]] = ParallelCollectionRDD[0] at parallelize at worksheetTest.sc4592:13

Now converting the rdd to dataframe is to call .toDF but before that you need to import implicits._ of sqlContext as below

val sqc = new SQLContext(sc)
import sqc.implicits._
rdd.toDF().show(false)

You should have dataframe as

+------+
|value |
+------+
|[a, b]|
|[d, e]|
|[g, f]|
|[e, r]|
+------+

Isn't this all simple?

Ramesh Maharjan
  • 41,071
  • 6
  • 69
  • 97