0

I'm trying to read a bunch of JSON files stored on S3, but is raising a list index out of range when I compute the DataFrame

My call to open the JSON files is like this:

pets_data = dd.read_json("s3://my-bucket/pets/*.json", meta=meta, blocksize=None, orient="records", lines=False)

and Is failing when I call to_csv (to S3 or to local, both fails)

# save on local fails
pets_data.to_csv(
        "pets-full-data.csv",
        single_file=True,
        index=False
    )
# save on S3 fails as well
pets_data.to_csv(
        "s3://my-bucket/pets-full-data.csv",
        single_file=True,
        index=False
    )

StackTrace:

File "main.py", line 89, in <module>
pets_data.to_csv(
File "/usr/local/lib/python3.8/site-packages/dask/dataframe/core.py", line 1423, in to_csv
return to_csv(self, filename, **kwargs)
File "/usr/local/lib/python3.8/site-packages/dask/dataframe/io/csv.py", line 808, in to_csv
value = to_csv_chunk(dfs[0], first_file, **kwargs)
IndexError: list index out of range

NOTE: This only occurs when I attempt to open the files from S3, when I open the files from local storage everything goes well

Andrew Gaul
  • 2,296
  • 1
  • 12
  • 19
Carlos Rojas
  • 334
  • 5
  • 13
  • 1
    Please check what files you see on s3: `import s3fs; s3 = s3fs.S3FileSystem(); s3.glob("s3://my-bucket/pets/*.json")` – mdurant Dec 20 '20 at 18:01
  • Your answer gave me a key point it fails when I try to list the files, the problem was that I forgot to add ListBucket permission Although I'm wondering why the library fails (in appearance) at compute time, instead of raise a error due S3 permissions – Carlos Rojas Dec 22 '20 at 23:43
  • 1
    Since you provide `meta`, dask doesn't need to read any data until compute time, but it probably should have noticed that there were zero files beforehand. – mdurant Dec 23 '20 at 13:59

0 Answers0