3

I'm new on Dask and I'm finding it quite useful, but I have a problem that I haven't been able to solve yet.

I have a data set larger than memory, and I want to remove duplicate values from a column.

The problem is that after this removal the data set will still remain larger than the memory. Therefore, the result needs to be calculated through files and saved directly to the disk.

Of course, I can build a code to manually do this removal, but I was wondering if Dask already has this implemented.

This is my code:

from dask.distributed import Client
import dask.dataframe as dd

client = Client(memory_limit='8GB') # I've tried without this limit
data = dd.read_csv("path_to_file", dtype={
    'id': 'Int64'
}, sample=1000)
data.drop_duplicates(subset=['text'])
results = data.compute() #   <- Here is the problem
results.to_csv("pathout", index=False)

When I call the compute, the result is a DataFrame pandas, which in this case, is larger than the memory. I'm receiving a lot of:

distributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting
distributed.nanny - WARNING - Restarting worker

and then execution fail with "KilledWorker"

EDIT:

Self contained example:

import numpy as np
import pandas as pd
from dask.distributed import Client
import dask.dataframe as dd

# Creates about 2Gb of data
data = np.random.randint(0, 10000, size=(2000000,200))
pd.DataFrame(data).to_csv('test_in.csv', index=False)

# If you want to run on terminal, uncomment the next line and identy the rest of the code
# if __name__ == '__main__':

# To test, limit dask to 1Gb
client = Client(n_workers=1, memory_limit='1GB')
df = dd.read_csv('test_in.csv', blocksize='16MB')
results = df.drop_duplicates()
results.to_csv('test_out.csv', index=False)
client.close()
Klaifer Garcia
  • 340
  • 1
  • 12
  • are you using `Jupyter lab`/`notebook` or `Google Colab` ? because in `Google Colab` you get around 12 GB of RAM for free so may be then execution will not fail with `"KilledWorker"` – M_x Oct 16 '20 at 12:13

2 Answers2

3
from dask.distributed import Client
import dask.dataframe as dd


client = Client(memory_limit='8GB')
data = dd.read_csv("path_to_file", dtype={'id': 'Int64'}, sample=1000)
results = data.drop_duplicates(subset=['A'])  # Don't call compute here
results.to_csv("pathout", index=False)  # Write operations automatically call compute

.compute() will return a Pandas dataframe and from there Dask is gone. You can use the .to_csv() function from Dask and it will save a file for each partition.

Just remove the .compute() and it will work if every partition fits into memory.

Oh and you need the assign the result of .drop_duplicates().

JulianWgs
  • 961
  • 1
  • 14
  • 25
  • I tried this change, assigning results and removing the compute(). results = data.drop_duplicates(subset=['text']) results.to_csv("pathout", index=False) Looking at the client's dashboard, I think everything worked, since the tasks of read_csv, drop_duplicated and _write_csv appeared. But the process still ends with KilledWorker – Klaifer Garcia Oct 16 '20 at 13:29
  • One partition/chunk is larger than memory. Lower the size of a partion in the read_csv function. Look in the Dask docs for that. Also lower memory_limit – JulianWgs Oct 16 '20 at 13:43
  • I tried different values for blocksize in read_csv, 32MB, 16MB, 8MB, 1MB, and the error is always the same. But I noticed something else. Looking at the execution graph, there are some levels of tasks, the first is reading the data, then drop-duplicates-chunk and then drop-duplicates-combine. The drop-duplicates-combine are bigger, it seems that it accumulates all the data of the parts.Perhaps the memory limit is being exceeded in these combinations. – Klaifer Garcia Oct 18 '20 at 15:43
  • Could you post the execution graph here? Try values more like 1GB. Did you also lower the memory limit? Also you might want to use a Dask performance report to better show what is going on (https://distributed.dask.org/en/latest/diagnosing-performance.html#performance-reports). If you really want to me/us help you please write a self contained minimum working example (https://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports). You could for example create a csv file with random data. – JulianWgs Oct 18 '20 at 15:54
  • I've added a code example to my answer. Could you try to execute that? – JulianWgs Oct 18 '20 at 16:02
  • My code was very similar to this. I have included code to reproduce the error. – Klaifer Garcia Oct 20 '20 at 11:37
0

I think you worker is killed because drop_duplicates reset df.npartitions to 1. Try printing df.npartitions before and after to make sure.

One thing you can try is results = df.drop_duplicates(split_out=df.npartitions) This will still take long time to compute tough..

  • I don't think that's it. I included the df.npartitions prints as you said, but this number didn't change during execution. Even after KilledWorker it didn't change. I also tried to include the split_out parameter in drop_duplicates, but it didn't work. The error is the same. – Klaifer Garcia Aug 19 '21 at 19:47