Convert RDD to Dataframe in Spark Streaming Python

Question

I am trying to convert RDD to DataFrame in Spark Streaming. I am following below process.

socket_stream = ssc.socketTextStream("localhost", 9999)
def convert_to_df(rdd):
    schema = StructType([StructField("text", StringType(), True)])
    df =spark.createDataFrame(rdd, schema = schema)
    df.show(10)

socket_stream.foreachRDD(convert_to_df)

I am providing input through socket nc -lk 9999

If I give "hello world" as my input it is showing me below error

StructType can not accept object 'hello world' in type <class 'str'>

expected output

+-------=-+
|text     |
+---------+
hello world
+---------+

10465355 · Accepted Answer · 2018-12-13T14:18:59.473

Since you use RDD[str] you should either provide a matching type. For an atomic value it is either a corresponding AtomicType

from pyspark.sql.types import StringType, StructField, StructType

rdd = sc.parallelize(["hello world"])
spark.createDataFrame(rdd, StringType())

or its string description:

spark.createDataFrame(rdd, "string")

If you want to use StructType convert data to tuples first:

schema = StructType([StructField("text", StringType(), True)])

spark.createDataFrame(rdd.map(lambda x: (x, )), schema)

Of course if you're going to just convert each batch to DataFrame it makes much more sense to use Structured Streaming all the way:

lines = (spark
    .readStream
    .format("socket")
    .option("host", "localhost")
    .option("port", 9999)
    .load())

score -1 · Answer 2 · edited Dec 13 '18 at 13:18

-1

Try ArrayType(StringType())

Else since you have only one column try giving the schema directly as

df =spark.createDataFrame(rdd, StringType())

Check out udf for pyspark as you need to declare a udf for spark

edited Dec 13 '18 at 13:18

TheLethalCoder

6,668
6
34
69

answered Dec 13 '18 at 13:15

Shark

1

Convert RDD to Dataframe in Spark Streaming Python

2 Answers2