Scala Spark : How to create a RDD from a list of string and convert to DataFrame

Question

I want to create a DataFrame from a list of string that could match existing schema. Here is my code.

    val rowValues = List("ann", "f", "90", "world", "23456") // fails
    val rowValueTuple = ("ann", "f", "90", "world", "23456") //works

    val newRow = sqlContext.sparkContext.parallelize(Seq(rowValueTuple)).toDF(df.columns: _*)

    val newdf = df.unionAll(newRow).show()

The same code fails if i use the List of String. I see the difference is with rowValueTuple a Tuple is created. Since the size of rowValues list dynamically changes, i cannot manually create Tuple* object. How can i do this? What am i missing? How can i flatten this list to meet the requirement?

Appreciate your help, Please.

The first gives you a DF with one column and 5 rows. The second gives you a DF with a single row with a single column that contains a tuple. Very different things. — The Archetypal Paul, Apr 21 '16 at 12:25

score 15 · Accepted Answer · answered Apr 21 '16 at 12:28

15

DataFrame has schema with fixed number of columns, so it's seems not natural to make row per list of variable length. Anyway, you can create your DataFrame from RDD[Row] using existing schema, like this:

val rdd = sqlContext.sparkContext.parallelize(Seq(rowValues))
val rowRdd = rdd.map(v => Row(v: _*))
val newRow = sqlContext.createDataFrame(rdd, df.schema)

answered Apr 21 '16 at 12:28

Vitalii Kotliarenko

2,947
18
26

Thanks @vitality . I did try this. But missed something. I agree with your point. But i want to perform this for given pair of dataframe and list of row values as parameters. The number of column of dataframe and length of row values is assumed to be same. – NehaM Apr 21 '16 at 12:42
3

Just a note here, the last line should be ```val newRow = sqlContext.createDataFrame(rowRdd, df.schema)``` At least that's what worked for me. – Rylan Nov 22 '16 at 23:36
4

@Rylan: What is `df` here? – Dinosaurius May 12 '17 at 20:42
If pyspark instead of scala, then what would be the map line of code? – Geoffrey Anderson Jun 13 '18 at 19:08

Scala Spark : How to create a RDD from a list of string and convert to DataFrame

1 Answers1