0

I have a dataframe with 320 rows. I converted it to ndjson with pandas:

df.to_json('file.json', orient='records', lines=True)

However upon loading the data, I only obtain 200 rows.

with open('file.json') as f:
    print(len(f.readlines()))

gives 200

spark.read.json('file.json').count

also gives 200

Only reloading it with pandas give the correct row count:

pd.read_json('file.json', orient='records', lines=True)

My dataset contains \n characters in the fields. I am expecting to have as much or more lines when I load the records with python or spark.

What is the issue here with the pandas.to_json method?

hangc
  • 4,730
  • 10
  • 33
  • 66

1 Answers1

0

I manually inspected the json file line by line, and I discovered that pandas.to_json seems to be writing wrongly. (or I misunderstood the specifications)

with open('file.json') as f:
    j = f.read().replace('},{', '}\n{')
with open('file.jsonl', 'w') as f:
    f.write(j)

Replacing the errors in the file fixes the issues.

hangc
  • 4,730
  • 10
  • 33
  • 66