0

My dask script runs well until the last step, which concats thousands of dataframes together and writes to CSV. Memory use immediately jumps from 6GB to over 15GB and I receive an error like "95% memory exceeded, restarting workers". My machine has plenty of memory though. I have two questions: (1) how can I increase the available memory for workers or for this last step? (2) Would intermediate concat steps help and how best to add them? The problematic code is below is:

future = client.submit(pd.concat, tasks)
future.result().to_csv(path)
D. Vyd
  • 15
  • 4
  • Have you looked at https://stackoverflow.com/questions/54459056/dask-memory-error-when-running-df-to-csv – quasiben Jan 07 '20 at 15:24
  • The issue is similar. I also received the error despite having more memory available. Using blocksize in read_csv is problematic for me because my data are grouped and loading a partial group will ruin the process. Concatting results is a very common map-reduce step. Are other people using many intermediate concats? Would that decrease memory compared to a single giant concat? – D. Vyd Jan 08 '20 at 11:21
  • Agreed, concatting is very common. Looking at the code again, it seems odd that you are not using a high order collection like the dask dataframe for dataframe-like operations? Any reason for that ? If you were you could write each partition to a csv. Additionally, if you were and still wanted a single csv, this could be done as well: https://docs.dask.org/en/latest/dataframe-api.html#dask.dataframe.to_csv – quasiben Jan 08 '20 at 14:23
  • @quasiben, although I'm using a dataframe now, we conduct a lot of simulations involving arbitrary code and data. I'm trying to understand how to submit tasks to the dask client and act on the results (not always concat). I noticed in the link you shared that the user encountered the same memory problem using a high-order collection. I'll switch to that approach though if you think it is a better (or my only) solution. – D. Vyd Jan 09 '20 at 11:36
  • Dask's "as_completed" might provide an alternative to concatting all task results in a single final step. For now, I'm writing each task result to a CSV and concatting via Windows copy command. Only the first CSV has headers. – D. Vyd Jan 12 '20 at 12:50

0 Answers0