How to create a pyspark dataframe from multiple lists

Question

I would like to convert two lists to a pyspark data frame, where the lists are respective columns.

I tried

a=[1, 2, 3, 4]
b=[2, 3, 4, 5]
sqlContext.createDataFrame([a, b], schema=['a', 'b']).show()

But I got

+---+---+---+---+                                                               
|  a|  b| _3| _4|
+---+---+---+---+
|  1|  2|  3|  4|
|  2|  3|  4|  5|
+---+---+---+---+

What I really want is this:

+---+---+                                                              
|  a|  b|
+---+---+
|  1|  2|
|  2|  3|
|  3|  4|
|  4|  5|
+---+---+

Is there a convenient way to create this result?

check this out: https://stackoverflow.com/questions/48448473/pyspark-convert-a-standard-list-to-data-frame?noredirect=1&lq=1 — Itachi, Oct 12 '18 at 17:14

score 13 · Accepted Answer · answered Oct 12 '18 at 17:16

13

Just transpose the lists:

sqlContext.createDataFrame(zip(a, b), schema=['a', 'b']).show()

answered Oct 12 '18 at 17:16

fafl

7,222
3
27
50

Dan · Answer 2 · 2018-10-12T17:23:26.580

4

I don't know about pyspark directly, but I would guess instead of this data structure:

[[1, 2, 3, 4],
 [2, 3, 4, 5]]

you need to give it this

[[1, 2],
 [2, 3],
 [3, 4],
 [4, 5]]

An explanatory way to go from your data structure to what is required is to use numpy to transpose:

import numpy as np
a=[1, 2, 3, 4]
b=[2, 3, 4, 5]
sqlContext.createDataFrame((np.array([a, b])).T, schema=['a', 'b']).show()

edited Oct 12 '18 at 17:23

answered Oct 12 '18 at 17:20

Dan

45,079
17
88
157

1

You'll probably want to be doing this if your data is going to be of any substantial size. And you'd likely want to go further and not use lists in the first place, instead having your data live in numpy arrays from the start. – PMende Oct 12 '18 at 17:21
1

to add to that, if your data aren't of substantial size, then why use spark at all? – Dan Oct 12 '18 at 17:23

score 1 · Answer 3 · answered Oct 28 '20 at 05:26

Below are the steps to create pyspark dataframe Create sparksession

spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()

Create data and columns

columns = ["language","users_count"]
data = [("Java", "20000"), ("Python", "100000"), ("Scala", "3000")]

Creating DataFrame from RDD

rdd = spark.sparkContext.parallelize(data)
df= spark.createDataFrame(rdd).toDF(*columns)

the second approach, Directly creating dataframe

df2 = spark.createDataFrame(data).toDF(*columns)

How to create a pyspark dataframe from multiple lists

3 Answers3