0

I am trying to convert RDD to DataFrame in Spark Streaming. I am following below process.

socket_stream = ssc.socketTextStream("localhost", 9999)
def convert_to_df(rdd):
    schema = StructType([StructField("text", StringType(), True)])
    df =spark.createDataFrame(rdd, schema = schema)
    df.show(10)

socket_stream.foreachRDD(convert_to_df)

I am providing input through socket nc -lk 9999

If I give "hello world" as my input it is showing me below error

StructType can not accept object 'hello world' in type <class 'str'>

expected output

+-------=-+
|text     |
+---------+
hello world
+---------+
10465355
  • 4,481
  • 2
  • 20
  • 44
LUZO
  • 1,019
  • 4
  • 19
  • 42

2 Answers2

1

Since you use RDD[str] you should either provide a matching type. For an atomic value it is either a corresponding AtomicType

from pyspark.sql.types import StringType, StructField, StructType

rdd = sc.parallelize(["hello world"])
spark.createDataFrame(rdd, StringType())

or its string description:

spark.createDataFrame(rdd, "string")

If you want to use StructType convert data to tuples first:

schema = StructType([StructField("text", StringType(), True)])

spark.createDataFrame(rdd.map(lambda x: (x, )), schema)

Of course if you're going to just convert each batch to DataFrame it makes much more sense to use Structured Streaming all the way:

lines = (spark
    .readStream
    .format("socket")
    .option("host", "localhost")
    .option("port", 9999)
    .load())
10465355
  • 4,481
  • 2
  • 20
  • 44
-1

Try ArrayType(StringType())

Else since you have only one column try giving the schema directly as

df =spark.createDataFrame(rdd, StringType())

Check out udf for pyspark as you need to declare a udf for spark

TheLethalCoder
  • 6,668
  • 6
  • 34
  • 69
Shark
  • 1