0

I have 100 large json files in gcs and want to load them in a panda dataframe. I've used something like below in dask:

 dd.read_json('gs://dask_poc/2018-04-18/data-*.json')

But when I used:

 pd.read_json('gs://dask_poc/2018-04-18/data-*.json')

I got the below error: ValueError: Expected object or value

Wondering if panda cant aggregate all the files together similar to dask?

MT467
  • 668
  • 2
  • 15
  • 31
  • This may sound like a silly question and I probably already know the answer, but where are you running this code? – cs95 Mar 14 '19 at 21:01
  • @coldspeed locally in my jupyterlab – MT467 Mar 14 '19 at 21:01
  • you could probably use a for loop to open each file in that folder rand execute whatever code you have for each json file – dataviews Mar 14 '19 at 21:02
  • Unfortunately, pandas does not have native GCP support, nor can it be expected to magically understand GCP links. – cs95 Mar 14 '19 at 21:02
  • Does the answer to this question help? https://stackoverflow.com/questions/46885631/loading-multiple-files-from-google-cloud-storage-into-a-single-pandas-dataframe – dim_user Mar 14 '19 at 21:04
  • @coldspeed I thought the same but today when I check this link, they put gcs as a path too! https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_json.html – MT467 Mar 14 '19 at 21:04
  • 1
    Well, I'm floored. How is that possible without any authentication from your part? – cs95 Mar 14 '19 at 21:05
  • @saul cruz yeah, saw similar answers but my files are huge and concat may not be promising nor for loop. – MT467 Mar 14 '19 at 21:07
  • @coldspeed lol, not sure – MT467 Mar 14 '19 at 21:08
  • Learned about gcs support in pandas today, thanks guys! See [`pandas.io.gsc.py`](https://github.com/pandas-dev/pandas/blob/v0.24.2/pandas/io/gcs.py), and [`gcsfs` documentation](https://gcsfs.readthedocs.io/en/latest/) for some details. Not sure what the described error is. – FabienP Mar 14 '19 at 21:28
  • [This answer](https://stackoverflow.com/a/52106361/6914989) could help. Looks like glob aggregation has to be done manually for pandas. – FabienP Mar 14 '19 at 21:33

0 Answers0