In StructStreaming of pyspark; How do I convert each row (a json-formatted string) in the DataFrame into multiple columns

Question

My DataFrame structure looks like this

+--------------------+
|              values|
+--------------------+
|{"user_id":"00000...|
+--------------------+

And then the string structure here looks like this

{
    "user_id":"00000000002",
    "client_args":{
        "order_by":"id",
        "page":"4",
        "keyword":"Blue flowers",
        "reverse":"false"
    },
    "keyword_tokenizer":[
        "Blue",
        "flowers"
    ],
    "items":[
        "00000065678",
        "00000065707",
        "00000065713",
        "00000065741",
        "00000065753",
        "00000065816",
        "00000065875",
        "00000066172"
    ]
}

I want this DataFrame to look like this

+---------------+-------------------+------------------+----------------------------+
|    user_id    |     client_args   | keyword_tokenizer|            items          |
+---------------+-------------------+------------------+----------------------------+
|00000000000001 |{"order_by":"",...}|["Blue","flowers"]|["000006578","00002458",...]|
+---------------+-------------------+------------------+----------------------------+

My code looks like this

lines = spark_session\
    .readStream\
    .format("socket")\
    .option("host", "127.0.0.1")\
    .option("port", 9998)\
    .load()

@f.udf("struct<user_id:string,client_args:string,keyword_tokenizer:array>")
def str_to_json(s):
    return json.loads(s)

lines.select(str_to_json(lines.values))

But this only converts them to JSON, can't to column splitting. What should I do?

In addition: I found this method to solve this problem later. Is that inefficient?

schema = StructType([StructField("user_id",StringType()),
                     StructField("client_args", StructType([
                         StructField("order_by", StringType()),
                         StructField("page", StringType()),
                         StructField("keyword", StringType()),
                         StructField("reverse", StringType()),
                     ])),
                     StructField("keyword_tokenizer", ArrayType(StringType())),
                     StructField("items", ArrayType(StringType()))])

new_df = lines.withColumn("tmp", f.from_json(lines.values, schema))\
    .withColumn("user_id", f.col("tmp").getItem("user_id"))\
    .withColumn("client_args", f.col("tmp").getItem("client_args"))\
    .withColumn("keyword_tokenizer", f.col("tmp").getItem("keyword_tokenizer"))\
    .withColumn("items", f.col("tmp").getItem("items"))\
    .drop("value", "tmp")

Does this answer your question? [Pyspark: Parse a column of json strings](https://stackoverflow.com/questions/41107835/pyspark-parse-a-column-of-json-strings) — Alex Ott, Apr 21 '20 at 07:15
I appreciate this, but my code works in 'Struct Streaming', I used the method in the link you provided, and an error occurred: Queries with streaming sources must be executed with writeStream. Start ();; \ ntextSocket ' Can you help me? — NickWick, Apr 21 '20 at 08:25
it's not related error... basically you need to do following: 1) create a schema for data (for example you can read a JSON file and allow to infer schema; 2) use `from_json` with created schema to convert strings into columns. Here is example in Scala that does this - I don't have example with Python: https://github.com/alexott/dse-playground/blob/master/spark-dse/src/main/scala/com/datastax/alexott/streaming/StructuredStreamingKafkaDSE.scala#L29 — Alex Ott, Apr 21 '20 at 10:15
Thanks,I looked at your code and copied it in python to see if it was inefficient, and I added the code to the problem. — NickWick, Apr 21 '20 at 10:30
after decoding JSON, you can do `.select("*", "tmp.*").drop("tmp)`, so you don't need to extract nested fields one by one... — Alex Ott, Apr 21 '20 at 12:56

score 0 · Answer 1 · answered Apr 21 '20 at 07:35

0

read as a json file using pyspark

df = spark.read.json("test.json") 
df.show()

+--------------------+--------------------+-----------------+-----------+
|         client_args|               items|keyword_tokenizer|    user_id|
+--------------------+--------------------+-----------------+-----------+
|[Blue flowers, id...|[00000065678, 000...|  [Blue, flowers]|00000000002|
+--------------------+--------------------+-----------------+-----------+

answered Apr 21 '20 at 07:35

Jay Kakadiya

501
1
5
12

Sorry, this code of mine is a stream handler, and then it's passed over here using a data stream – NickWick Apr 21 '20 at 07:56

In StructStreaming of pyspark; How do I convert each row (a json-formatted string) in the DataFrame into multiple columns

1 Answers1