2

I have a file in JSON Lines format with the following content:

[1, "James", 21, "M", "2016-04-07 10:25:09"]
[2, "Liz", 25, "F", "2017-05-07 20:25:09"]
...

Each line is a JSON array string, and the types of fields are: integer、string、integer、string、string. How to convert it to a DataFrame with the following schema?

root
 |-- id: integer (nullable = true)
 |-- name: string (nullable = true)
 |-- age: integer (nullable = true)
 |-- gender: string (nullable = true)
 |-- time: string (nullable = true)

On the contrary, if I have a DataFrame with the above schema, how to generate a file like the above JSON Lines format?

ZygD
  • 22,092
  • 39
  • 79
  • 102
will.wang
  • 21
  • 2
  • Can you please complete the JSON file format and your expected dataframe? since the data provided is not a proper JSON file... need to check the complete JSON file.. – Nikhil Suthar Dec 07 '21 at 06:09

1 Answers1

0

Assuming your file does not have headers line, this is one way to create a df from your file. But I'd expect there was be a better option.

df = spark.read.text("file_jsonlines")
c = F.split(F.regexp_extract('value', '\[(.*)\]', 1), ',')
df = df.select(
    c[0].cast('int').alias('id'),
    c[1].alias('name'),
    c[2].cast('int').alias('age'),
    c[3].alias('gender'),
    c[4].alias('time'),
)
+---+--------+---+------+----------------------+
|id |name    |age|gender|time                  |
+---+--------+---+------+----------------------+
|1  | "James"|21 | "M"  | "2016-04-07 10:25:09"|
|2  | "Liz"  |25 | "F"  | "2017-05-07 20:25:09"|
+---+--------+---+------+----------------------+

root
 |-- id: integer (nullable = true)
 |-- name: string (nullable = true)
 |-- age: integer (nullable = true)
 |-- gender: string (nullable = true)
 |-- time: string (nullable = true)
ZygD
  • 22,092
  • 39
  • 79
  • 102