2

I have a text file which is way bigger than my memory. I want to sort the lines of that file lexicographically. I know how to do it manually:

  1. Split into chunks which fit into memory
  2. Sort the chunks
  3. Merge the chunks

I wanted to do it with dask. I thought dealing with big amounts of data would be one use case of dask. How can I sort the whole data with Dask?

My Try

You can execute generate_numbers.py -n 550_000_000 which will take about 30 minutes and generate a 20 GB file.

import dask.dataframe as dd

filename = "numbers-large.txt"

print("Create ddf")
ddf = dd.read_csv(filename, sep = ',', header = None).set_index(0)

print("Compute ddf and sort")
df = ddf.compute().sort_values(0)

print("Write")
with open("numbers-large-sorted-dask.txt", "w") as fp:
    for number in df.index.to_list():
        fp.write(f"{number}\n")

when I execute this, I get

Create ddf
Compute ddf and sort
[2]    2437 killed     python dask-sort.py

I guess the process is killed because it consumes too much memory?

Martin Thoma
  • 124,992
  • 159
  • 614
  • 958

1 Answers1

2

Try the following code:

import dask
import dask.dataframe as dd

inpFn = "numbers-large.txt"
outFn = "numbers-large-sorted-dask.txt"
blkSize = 500   # For test on a small file - increase it

print("Create ddf")
ddf = dd.read_csv(inpFn, header = None, blocksize=blkSize)

print("Sort")
ddf_sorted = ddf.set_index(0)

print("Write")
fut = ddf_sorted.to_csv(outFn, compute=False, single_file=True, header=None)
dask.compute(fut)
print("Stop")

Note that I set so low blkSize parameter just for test purpose. In the target version either increase its value or drop, along with blocksize=blkSize, to accept the default value.

As set_index provides the sort, there is no need to call sort_values() and other detail is that dask does not support this method.

As far as writing is concerned, I noticed that you want to generate a single output file, instead of a sequence of files (one file for each partition), so I passed single_file=True.

I also added header=None to block writing the column name, in this case (not very meaningful) 0.

The last detail to mention is compute=False, so that dask generates a sequence of future objects, without executing them (computing it) - for now.

All operations so far only constructed the computation tree, without its execution. As late as now, compute(...) runs the whole computation tree.

Edit

Your code probably failed due to:

df = ddf.compute().sort_values(0)

Note that you:

  • first compute(), to generate the whole pandasonic DataFrame,
  • after that, at the Pandas level, you attempt to sort it.

The problem is probably that the memory in your computer is not big enough to hold the whole result of compute(). So most likely your code failed just at this moment, without any chance to sort this DataFrame.

Valdi_Bo
  • 30,023
  • 4
  • 23
  • 41
  • I will later try it with a smaller file, but with the 20GB file I interrupted the execution after more than 5 hours. With Bash (split + sort) it takes ~24 minutes to sort that file. – Martin Thoma May 22 '20 at 19:17
  • Check also this solution: https://stackoverflow.com/questions/46971219/dask-set-index-from-large-unordered-csv-file – Valdi_Bo May 24 '20 at 04:57