how to create DataFrame from multiple arrays in Spark Scala?

Question

val tvalues: Array[Double] = Array(1.866393526974307, 2.864048126935307, 4.032486069215076, 7.876169953355888, 4.875333799256043, 14.316322626848278)
val pvalues: Array[Double] = Array(0.064020056478447, 0.004808399479386827, 8.914865448939047E-5, 7.489564524121306E-13, 2.8363794106756046E-6, 0.0)

I have two Arrays as above, i need to build a DataFrame from this Arrays like the following,

Tvalues                Pvalues
1.866393526974307      0.064020056478447
2.864048126935307      0.004808399479386827
......                 .....

As of now i am trying with StringBuilder in Scala. which doesnt go as expected. help me on this please.

elm · Accepted Answer · 2016-05-11T09:44:47.667

15

Try for instance

val df = sc.parallelize(tpvalues zip pvalues).toDF("Tvalues","Pvalues")

and thus

scala> df.show
+------------------+--------------------+
|          Tvalues|             Pvalues|
+------------------+--------------------+
| 1.866393526974307|   0.064020056478447|
| 2.864048126935307|0.004808399479386827|
| 4.032486069215076|8.914865448939047E-5|
| 7.876169953355888|7.489564524121306...|
| 4.875333799256043|2.836379410675604...|
|14.316322626848278|                 0.0|
+------------------+--------------------+

Using parallelize we obtain an RDD of tuples -- the first element from the first array, the second element from the other array --, which is transformed into a dataframe of rows, one row for each tuple.

Update

For dataframe'ing multiple arrays (all with the same size), for instance 4 arrays, consider

case class Row(i: Double, j: Double, k: Double, m: Double)

val xs = Array(arr1, arr2, arr3, arr4).transpose
val rdd = sc.parallelize(xs).map(ys => Row(ys(0), ys(1), ys(2), ys(3))
val df = rdd.toDF("i","j","k","m")

edited May 11 '16 at 09:44

answered May 11 '16 at 06:24

elm

20,117
14
67
113

hi elm, suppose i have four arrays like this how can i do that? – Sam May 11 '16 at 07:23
`val xs = Array(a1,a2,a3,a4).transpose` and then for each nested array construct case class, parallelize case classes and then toDF(...). – elm May 11 '16 at 08:44
Sorry @elm, i am not getting it, can you provide a sample for it. forgive me, i am new to spark scala. `val xs = Array(a1,a2,a3,a4).transpose` `sc.parallelize(xs(0) zip xs(1),xs(2), xs(3)).toDF("a","b","c","d")' . this is the code i tried – Sam May 11 '16 at 09:02
hi @elm, i got a error while running this code as a spark application via spark-submit. but it works fine in spark-shell. i dont understand this behaviour? – Sam May 12 '16 at 10:15
i am getting error like toDF() not found, by googling i get that we need to use case class outside main function and i did the same, it worked, but when i tried to save the dataframe into hive i am getting ArrayIndexOutOfBound Exception – Sam May 12 '16 at 11:02
import org.apache.spark.sql._ ; import org.apache.spark.sql.SQLContext; val sqlCtx = new SQLContext(sc); import sqlCtx.implicits._ – elm May 12 '16 at 11:22
http://stackoverflow.com/questions/37185541/spark-scala-error-while-saving-dataframe-to-hive i have posted a question with my code and error i am getting, kindly have a look on this – Sam May 12 '16 at 12:07
Hi Elm , When one of the arrays is string , then this code throws an error what do you do in that case ? – Leothorn Mar 26 '18 at 10:22
Just don't forget `import sparkSession.implicits._`. @Leothorn Make sure your array is not of type `Any` – belka Apr 08 '19 at 11:54

how to create DataFrame from multiple arrays in Spark Scala?

1 Answers1