5

I would like to convert two lists to a pyspark data frame, where the lists are respective columns.

I tried

a=[1, 2, 3, 4]
b=[2, 3, 4, 5]
sqlContext.createDataFrame([a, b], schema=['a', 'b']).show()

But I got

+---+---+---+---+                                                               
|  a|  b| _3| _4|
+---+---+---+---+
|  1|  2|  3|  4|
|  2|  3|  4|  5|
+---+---+---+---+

What I really want is this:

+---+---+                                                              
|  a|  b|
+---+---+
|  1|  2|
|  2|  3|
|  3|  4|
|  4|  5|
+---+---+

Is there a convenient way to create this result?

statsNoob
  • 1,325
  • 5
  • 18
  • 36
  • check this out: https://stackoverflow.com/questions/48448473/pyspark-convert-a-standard-list-to-data-frame?noredirect=1&lq=1 – Itachi Oct 12 '18 at 17:14

3 Answers3

13

Just transpose the lists:

sqlContext.createDataFrame(zip(a, b), schema=['a', 'b']).show()
fafl
  • 7,222
  • 3
  • 27
  • 50
4

I don't know about pyspark directly, but I would guess instead of this data structure:

[[1, 2, 3, 4],
 [2, 3, 4, 5]]

you need to give it this

[[1, 2],
 [2, 3],
 [3, 4],
 [4, 5]]

An explanatory way to go from your data structure to what is required is to use numpy to transpose:

import numpy as np
a=[1, 2, 3, 4]
b=[2, 3, 4, 5]
sqlContext.createDataFrame((np.array([a, b])).T, schema=['a', 'b']).show()
Dan
  • 45,079
  • 17
  • 88
  • 157
  • 1
    You'll probably want to be doing this if your data is going to be of any substantial size. And you'd likely want to go further and not use lists in the first place, instead having your data live in numpy arrays from the start. – PMende Oct 12 '18 at 17:21
  • 1
    to add to that, if your data aren't of substantial size, then why use spark at all? – Dan Oct 12 '18 at 17:23
1

Below are the steps to create pyspark dataframe Create sparksession

spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()

Create data and columns

columns = ["language","users_count"]
data = [("Java", "20000"), ("Python", "100000"), ("Scala", "3000")]

Creating DataFrame from RDD

rdd = spark.sparkContext.parallelize(data)
df= spark.createDataFrame(rdd).toDF(*columns)

the second approach, Directly creating dataframe

df2 = spark.createDataFrame(data).toDF(*columns)
NNK
  • 1,044
  • 9
  • 24