3

I have the below code. It uses dask distributed to read 100 json files:(Workers: 5 Cores: 5 Memory: 50.00 GB)

  from dask.distributed import Client
  import dask.dataframe as dd

  client = Client('xxxxxxxx:8786')
  df = dd.read_json('gs://xxxxxx/2018-04-18/data-*.json')
  df = client.persist(df)

When I run the code, I only see one worker takes up the read_json() task, and then I got memory error and got WorkerKilled error.

Should I manually read each file and concat them? or is dask supposed to do it under-the-hood?

MT467
  • 668
  • 2
  • 15
  • 31

1 Answers1

2

You may want to use dask.bag instead of dask.dataframe

import json
import dask.bag as db
mybag = db.read_text('gs://xxxxxx/2018-04-18/data-*.json').map(json.loads)

After that you can convert the bag into a dask dataframe with

mybag.to_dataframe()

This may require some additional uses of dask.map to get the structure right.

If your data is hadoop style json (aka one object per line), the bag trick will still work but you may need to operate on individual lines.

mvn
  • 118
  • 1
  • 5
  • dask bag is pretty fast! but I gave it a try to test how fast is it to store the bag back to gcs with mybag.to_textfiles('gs://xxxxxx/2018-04-18/output/data*.json.gz') but I got cancelled error right away! is it a dask bug? – MT467 Apr 23 '19 at 15:49