I have the below code. It uses dask distributed to read 100 json files:(Workers: 5 Cores: 5 Memory: 50.00 GB)
from dask.distributed import Client
import dask.dataframe as dd
client = Client('xxxxxxxx:8786')
df = dd.read_json('gs://xxxxxx/2018-04-18/data-*.json')
df = client.persist(df)
When I run the code, I only see one worker takes up the read_json() task, and then I got memory error and got WorkerKilled error.
Should I manually read each file and concat them? or is dask supposed to do it under-the-hood?