6

I'm playing with some github user data and was trying to create a graph of all people in the same city. To do this i need to use the merge operation in dask. Unfortunately the github user base size is 6M and it seems that the merge operation is causing the resulting dataframe to blow up. I used the following code

import dask.dataframe as dd
gh = dd.read_hdf('data/github.hd5', '/github', chunksize=5000, columns=['id', 'city']).dropna()
st = dd.read_hdf('data/github.hd5', '/github', chunksize=5000, columns=['id', 'city']).dropna()
mrg = gh.merge(st, on='city').drop('city', axis=1)
mrg['max'] = mrg.max(axis=1)
mrg['min'] = mrg.min(axis=1)
mrg.to_castra('github')

I can merge on other criteria such as name/username using this code but i get MemoryError when i try and run the above code.

I have tried running this using sync/multiprocessing and threaded schedulers.

I'm trying to do this on a Dell Laptop i7 4core with 8GB RAM. Shouldn't dask to this operation in a chunked manner or am I getting this wrong? Is writing the code using pandas dataframe iterators the only way out?

Prasanjit Prakash
  • 409
  • 1
  • 6
  • 21
  • Everything here seems fine to me. You might be able to get more information about where the problem is using the [diagnostics](http://dask.readthedocs.io/en/latest/diagnostics.html#example) – MRocklin Aug 24 '16 at 11:25
  • @MRocklin with to_* methods dask seems to build the entire dataframe in memory instead of writing it out in chunks this can be a problem since the merge step can lead to a dataframe which will run out of memory. Since dask wants to provide out of core computations isn't this a major issue? I can provide an example if you wish. Edit: I have tried this with to_hdf as well – Prasanjit Prakash Aug 26 '16 at 21:00
  • I recommend writing to many hdf files with `to_hdf('filename.*.hdf')` or something similar. As you're doing it t's still writing out in chunks, but those chunks are backing up behind the writing process. Usually when trying to write to a monolithic file like this I use the single-threaded scheduler. This should be happening by default unless you've overridden it. – MRocklin Aug 26 '16 at 21:23
  • you're right writing to multiple hd5 files solved the problem but it seems to_castra doesn't write to multiple files using globbed filenames. Also, i tried to a single hdf5 file and a single node with all 3 schedulers get_sync, threaded and multiprocessing and it threw an error in all 3 cases. – Prasanjit Prakash Aug 27 '16 at 03:34
  • Can you raise an issue on the dask issue tracker with an [mcve](http://stackoverflow.com/help/mcve) ? – MRocklin Aug 27 '16 at 11:10
  • Castra is no longer supported. I recommend using HDF – MRocklin Aug 27 '16 at 11:10
  • Sure will do. Thank you so much for the replies. – Prasanjit Prakash Aug 28 '16 at 10:53
  • @PrasanjitPrakash could you submit a working example as an answer, please? I'm facing the same issue and I've tried with `my_df.to_hdf('/tmp/result.*.hdf', key='/data')` but It's running out of memory anyway – Genarito Jan 22 '20 at 15:34
  • @Genarito sorry but this was some time back and I don't think I have the code snippet stored. – Prasanjit Prakash Feb 02 '20 at 02:56
  • No problem! Thank you – Genarito Feb 02 '20 at 17:16

1 Answers1

1

Castra isn't supported anymore, so using HDF is recommended. From the comments, writing to multiple files using to_hdf() solved the memory error:

mrg.to_hdf('github-*.hdf')

Relevant documentation: https://docs.dask.org/en/latest/generated/dask.dataframe.DataFrame.to_hdf.html

pavithraes
  • 724
  • 5
  • 9