Parse and write JSON Lines format file

Question

I have a file in JSON Lines format with the following content:

[1, "James", 21, "M", "2016-04-07 10:25:09"]
[2, "Liz", 25, "F", "2017-05-07 20:25:09"]
...

Each line is a JSON array string, and the types of fields are: integer、string、integer、string、string. How to convert it to a DataFrame with the following schema?

root
 |-- id: integer (nullable = true)
 |-- name: string (nullable = true)
 |-- age: integer (nullable = true)
 |-- gender: string (nullable = true)
 |-- time: string (nullable = true)

On the contrary, if I have a DataFrame with the above schema, how to generate a file like the above JSON Lines format?

Can you please complete the JSON file format and your expected dataframe? since the data provided is not a proper JSON file... need to check the complete JSON file.. — Nikhil Suthar, Dec 07 '21 at 06:09

score 0 · Answer 1 · answered Dec 12 '21 at 13:25

Assuming your file does not have headers line, this is one way to create a df from your file. But I'd expect there was be a better option.

df = spark.read.text("file_jsonlines")
c = F.split(F.regexp_extract('value', '\[(.*)\]', 1), ',')
df = df.select(
    c[0].cast('int').alias('id'),
    c[1].alias('name'),
    c[2].cast('int').alias('age'),
    c[3].alias('gender'),
    c[4].alias('time'),
)

+---+--------+---+------+----------------------+
|id |name    |age|gender|time                  |
+---+--------+---+------+----------------------+
|1  | "James"|21 | "M"  | "2016-04-07 10:25:09"|
|2  | "Liz"  |25 | "F"  | "2017-05-07 20:25:09"|
+---+--------+---+------+----------------------+

root
 |-- id: integer (nullable = true)
 |-- name: string (nullable = true)
 |-- age: integer (nullable = true)
 |-- gender: string (nullable = true)
 |-- time: string (nullable = true)

Parse and write JSON Lines format file

1 Answers1