48

The case is really simple, I need to convert a python list into data frame with following code

from pyspark.sql.types import StructType
from pyspark.sql.types import StructField
from pyspark.sql.types import StringType, IntegerType

schema = StructType([StructField("value", IntegerType(), True)])
my_list = [1, 2, 3, 4]
rdd = sc.parallelize(my_list)
df = sqlContext.createDataFrame(rdd, schema)

df.show()

it failed with following error:

    raise TypeError("StructType can not accept object %r in type %s" % (obj, type(obj)))
TypeError: StructType can not accept object 1 in type <class 'int'>
seiya
  • 1,477
  • 3
  • 17
  • 26

2 Answers2

72

This solution is also an approach that uses less code, avoids serialization to RDD and is likely easier to understand:

from pyspark.sql.types import IntegerType

# notice the variable name (more below)
mylist = [1, 2, 3, 4]

# notice the parens after the type name
spark.createDataFrame(mylist, IntegerType()).show()

NOTE: About naming your variable list: the term list is a Python builtin function and as such, it is strongly recommended that we avoid using builtin names as the name/label for our variables because we end up overwriting things like the list() function. When prototyping something fast and dirty, a number of folks use something like: mylist.

E. Ducateme
  • 4,028
  • 2
  • 20
  • 30
16

Please see the below code:

    from pyspark.sql import Row
    li=[1,2,3,4]
    rdd1 = sc.parallelize(li)
    row_rdd = rdd1.map(lambda x: Row(x))
    df=sqlContext.createDataFrame(row_rdd,['numbers']).show()

df

+-------+
|numbers|
+-------+
|      1|
|      2|
|      3|
|      4|
+-------+
Pang
  • 9,564
  • 146
  • 81
  • 122
user15051990
  • 1,835
  • 2
  • 28
  • 42
  • thanks for the quick answer, this works but I'd like to understand what's different between your approach and my approach? In your code you convert each RDD item into a Row and my code didn't do that, is that way my code failed? – seiya Jan 25 '18 at 17:35
  • Yup, to read your list into data frame, you have to convert it into row. From where you can directly read as data frame. Please accept the answer, if issue is resolved. – user15051990 Jan 25 '18 at 17:43
  • You can refer this link https://spark.apache.org/docs/1.1.1/api/python/pyspark.sql.Row-class.html for more details. – user15051990 Jan 25 '18 at 17:45
  • There should be right button below upvote and downvote. Just click on it. – user15051990 Jan 25 '18 at 17:50