4

Hi I have a python script that uses dask library to handle a very large data frame, larger than the physical memory. I notice that the job get killed in the middle of a run if the memory usage stays at 100% of the computer for some time.

Is it expected? I would thought the data would be spilled to disk and there are plenty of disk space left.

Is there a way to limit its total memory usage? Thanks

EDIT:

I also tried:

dask.set_options(available_memory=12e9)

It did not work. It did not seemed to limit its memory usage. Again, when memory usage reach 100%, the job gets killed.

Bo Qiang
  • 739
  • 2
  • 13
  • 34
  • 1
    How large is your data? Do you get an error message? Dask should only use memory when you call `dd.compute()`, where `dd` is your dask dataframe. – jpp Jan 24 '18 at 14:29
  • 1
    Can you post some code to see how you are calling dask methods? Also, did you check if dask processes are consuming 100% memory? – Anil_M Jan 24 '18 at 14:32
  • The CSV file is around 90 GB without compression and my physical memory is 16 GB. The most expensive part is a global sorting through set_index(). Basically, the code goes like this: ddf = dd.read_csv("*.csv"), ddf = ddf.set_index("sort_col").compute(). No error messages except the system tells me the job got killed. I am running it in a EC2 instance. – Bo Qiang Jan 24 '18 at 14:44
  • The code you are calling should be part of the question, not just in the comments. – mdurant Jan 24 '18 at 16:09

2 Answers2

3

The line

 ddf = ddf.set_index("sort_col").compute()

is actually pulling the whole dataframe into memory and converting to pandas. You want to remove the .compute(), and apply whatever logic (filtering, groupby/aggregations, etc.) that you want first, before calling compute to produce a result that is small enough.

The important thing to remember, is that the resultant output must be able to fit into memory, and each chunk that is being processed by each worker (plus overheads) also needs to be able to fit into memory.

mdurant
  • 27,272
  • 5
  • 45
  • 74
0

Try going through the data in chunks with:

chunksize = 10 ** 6 for chunk in pd.read_csv(filename, chunksize=chunksize): process(chunk)

rodcoelho
  • 44
  • 4
  • Thanks for your reply. In the code, I need to do a global sort through set_index(). Can I still process it in chunks? – Bo Qiang Jan 24 '18 at 14:50
  • 1
    Unfortunately no. Chunkwise processing this way is great for accumulation/aggregation, but not for anything that cannot be done independently per-chunk. – mdurant Jan 24 '18 at 16:12