0

I am trying to convert the below pipelined RDD into dataframe.

Pipelined RDD -> user_rdd

['new_user1',
 'new_user2',
 'Onlyknows',
 'Icetea',
 '_coldcoffee_']

I tried to convert using the below code

schema = StructType([StructField('Username', StringType(), True)])
user_df = sqlContext.createDataFrame(user_rdd,schema)
mention_df.show(20)

I am getting the below error:

ValueError: Unexpected tuple 'new_user1' with StructType

I tried using toDF() also:

user_df=user_rdd.toDF()

This time the error encountered is:

TypeError: Can not infer schema for type: <type 'str'>

Let me know if there is a way to convert this to dataframe using pyspark.

Alper t. Turker
  • 34,230
  • 9
  • 83
  • 115
Padfoot123
  • 1,057
  • 3
  • 24
  • 43

1 Answers1

1

The rdd you have is a list of strings, which is essentially 1d data; A data frame requires 2d data; Convert each element in the rdd to a tuple should resolve the issue:

user_df = sqlContext.createDataFrame(user_rdd.map(lambda x: (x,)), schema)
#                                             ^^^^^^^^^^^^^^^^^^^  
user_df.show()
+------------+
|    Username|
+------------+
|   new_user1|
|   new_user2|
|   Onlyknows|
|      Icetea|
|_coldcoffee_|
+------------+
Psidom
  • 209,562
  • 33
  • 339
  • 356