1

I'm trying to read some output from a SQL Server database into PySpark.

The format of the data is as follows:

{
  "var1": 0,
  "var2": 0,
  "var3": 1,
  "var4": null,
  "var5": 22156,
  "var6": 0,
  "var7": 1,
  "var8": "Denver",
  "var9": "Colorado",
  "var10": null,
  "var11": "2019-02-03",
  "var12": "5E98915C-FFE0-11E8-931E-0242AC11000A",
  "var13": null,
  "var14": null,
  "var15": "USA",
  "var16": "freight",
  "var17": null,
  "var18": null,
  "var19": "denv",
  "var20": "e385dbf6-8b27-4779-a566-f5b75feaf83e",
  "var21": 1547467268304,
  "var22": "3281523"
}
{
  "var1": 0,
  "var2": 0,
  "var3": 1,
  "var4": null,
  "var5": 65618,
  "var6": 0,
  "var7": 1,
  "var8": "Orlando",
  "var9": "Florida",
  "var10": null,
  "var11": "2018-12-14",
  "var12": "578F16D8-FFE0-11E8-931E-0242AC11000A",
  "var13": null,
  "var14": null,
  "var15": "USA",
  "var16": "air",
  "var17": null,
  "var18": null,
  "var19": "orla",
  "var20": "4a231f2c-fe1d-46e8-bc4a-d3333a081566",
  "var21": 1547467268305,
  "var22": "3281523"
}

Note: I have only included a couple of records for the sake of brevity

I then use the following (simple) code to read this into PySpark:

df = spark.read.option("multiline", "true").json("/Users/davidmoors/Documents/MSSQL_Output_Example")
print("--- JSON file read in: %s seconds ---" % (time.time() - start_time))


df.printSchema()
print(df.count())
df.show(5)

And I get the following output:

--- JSON file read in: 1.6511101722717285 seconds ---
root
 |-- var1: long (nullable = true)
 |-- var10: string (nullable = true)
 |-- var11: string (nullable = true)
 |-- var12: string (nullable = true)
 |-- var13: string (nullable = true)
 |-- var14: string (nullable = true)
 |-- var15: string (nullable = true)
 |-- var16: string (nullable = true)
 |-- var17: string (nullable = true)
 |-- var18: string (nullable = true)
 |-- var19: string (nullable = true)
 |-- var2: long (nullable = true)
 |-- var20: string (nullable = true)
 |-- var21: long (nullable = true)
 |-- var22: string (nullable = true)
 |-- var3: long (nullable = true)
 |-- var4: string (nullable = true)
 |-- var5: long (nullable = true)
 |-- var6: long (nullable = true)
 |-- var7: long (nullable = true)
 |-- var8: string (nullable = true)
 |-- var9: string (nullable = true)

1
+----+-----+----------+--------------------+-----+-----+-----+-------+-----+-----+-----+----+--------------------+-------------+-------+----+----+-----+----+----+------+--------+
|var1|var10|     var11|               var12|var13|var14|var15|  var16|var17|var18|var19|var2|               var20|        var21|  var22|var3|var4| var5|var6|var7|  var8|    var9|
+----+-----+----------+--------------------+-----+-----+-----+-------+-----+-----+-----+----+--------------------+-------------+-------+----+----+-----+----+----+------+--------+
|   0| null|2019-02-03|5E98915C-FFE0-11E...| null| null|  USA|freight| null| null| denv|   0|e385dbf6-8b27-477...|1547467268304|3281523|   1|null|22156|   0|   1|Denver|Colorado|
+----+-----+----------+--------------------+-----+-----+-----+-------+-----+-----+-----+----+--------------------+-------------+-------+----+----+-----+----+----+------+--------+

For some reason my code only reads in the first set of records, but not the second set?

Can someone advise me how I can read all the data into PySpark? I have a very large file with millions of records stored in this format.

Thanks.

dmoors
  • 11
  • 2
  • 1
    `"multiline"` option requires syntactically valid JSON and the input is not a one - to represent multiple objects you'd need JSON array. Otherwise you should drop `multiline` and use JSONL (single document per line). – 10465355 Feb 04 '19 at 17:04
  • @user10465355 many thanks for the suggestions. I'll take a look into these. – dmoors Feb 05 '19 at 09:03

0 Answers0