I have a text file which is way bigger than my memory. I want to sort the lines of that file lexicographically. I know how to do it manually:
- Split into chunks which fit into memory
- Sort the chunks
- Merge the chunks
I wanted to do it with dask. I thought dealing with big amounts of data would be one use case of dask. How can I sort the whole data with Dask?
My Try
You can execute generate_numbers.py -n 550_000_000 which will take about 30 minutes and generate a 20 GB file.
import dask.dataframe as dd
filename = "numbers-large.txt"
print("Create ddf")
ddf = dd.read_csv(filename, sep = ',', header = None).set_index(0)
print("Compute ddf and sort")
df = ddf.compute().sort_values(0)
print("Write")
with open("numbers-large-sorted-dask.txt", "w") as fp:
for number in df.index.to_list():
fp.write(f"{number}\n")
when I execute this, I get
Create ddf
Compute ddf and sort
[2] 2437 killed python dask-sort.py
I guess the process is killed because it consumes too much memory?