Pyspark convert a standard list to data frame

Question

The case is really simple, I need to convert a python list into data frame with following code

from pyspark.sql.types import StructType
from pyspark.sql.types import StructField
from pyspark.sql.types import StringType, IntegerType

schema = StructType([StructField("value", IntegerType(), True)])
my_list = [1, 2, 3, 4]
rdd = sc.parallelize(my_list)
df = sqlContext.createDataFrame(rdd, schema)

df.show()

it failed with following error:

    raise TypeError("StructType can not accept object %r in type %s" % (obj, type(obj)))
TypeError: StructType can not accept object 1 in type <class 'int'>

Your code failed because schema doesn't match the data. As per question linked above. — Alper t. Turker, Jan 25 '18 at 17:45

E. Ducateme · Accepted Answer · 2018-10-25T21:04:39.317

72

This solution is also an approach that uses less code, avoids serialization to RDD and is likely easier to understand:

from pyspark.sql.types import IntegerType

# notice the variable name (more below)
mylist = [1, 2, 3, 4]

# notice the parens after the type name
spark.createDataFrame(mylist, IntegerType()).show()

NOTE: About naming your variable list: the term list is a Python builtin function and as such, it is strongly recommended that we avoid using builtin names as the name/label for our variables because we end up overwriting things like the list() function. When prototyping something fast and dirty, a number of folks use something like: mylist.

edited Oct 25 '18 at 21:04

answered Jan 25 '18 at 21:21

E. Ducateme

4,028
2
20
30

1

This is the better answer because it avoids the serialization to `rdd`. – pault Jan 25 '18 at 21:28
3

Nice answer, clear answer. The last line, the `.show()` makes `df` hold `None`. – Joseph Cottam Aug 09 '18 at 20:51
4

Is there any way to give name to the Datafield (which in this case will be 'value' by default) – Sagar Mahour May 02 '20 at 17:11
changing the default column name `value` was baffling me as well. I have a workaround for it to just rename it: `tmp_df.selectExpr("value as text")` anyone has a better idea? – Artemis Dec 07 '20 at 15:10
2

has anyone tried this using millions of rows? I did, and it did not work very well. – FelipePerezR Jan 12 '21 at 20:35
Works ok w/ millions rows ```spark.createDataFrame(range(10000000), "INTEGER").show(5)``` – 0script0 Jun 24 '22 at 11:53

score 16 · Answer 2 · edited Feb 07 '18 at 05:21

16

Please see the below code:

    from pyspark.sql import Row
    li=[1,2,3,4]
    rdd1 = sc.parallelize(li)
    row_rdd = rdd1.map(lambda x: Row(x))
    df=sqlContext.createDataFrame(row_rdd,['numbers']).show()

df

+-------+
|numbers|
+-------+
|      1|
|      2|
|      3|
|      4|
+-------+

edited Feb 07 '18 at 05:21

Pang

9,564
146
81
122

answered Jan 25 '18 at 17:25

user15051990

1,835
2
28
42

thanks for the quick answer, this works but I'd like to understand what's different between your approach and my approach? In your code you convert each RDD item into a Row and my code didn't do that, is that way my code failed? – seiya Jan 25 '18 at 17:35
Yup, to read your list into data frame, you have to convert it into row. From where you can directly read as data frame. Please accept the answer, if issue is resolved. – user15051990 Jan 25 '18 at 17:43
You can refer this link https://spark.apache.org/docs/1.1.1/api/python/pyspark.sql.Row-class.html for more details. – user15051990 Jan 25 '18 at 17:45
There should be right button below upvote and downvote. Just click on it. – user15051990 Jan 25 '18 at 17:50

Pyspark convert a standard list to data frame

2 Answers2

Linked