3

I'm working with big CSV files and I need to make a Cartesian Product (merge operation). I've tried to face the problem with Pandas (you can check Panda's code and a data format example for the same problem, here) without success due to memory errors. Now, I'm trying with Dask, which is supposed to manage huge datasets even when its size is bigger than the available RAM.

First of all I read both CSV:

from dask import dataframe as dd

BLOCKSIZE = 64000000  # = 64 Mb chunks


df1_file_path = './mRNA_TCGA_breast.csv'
df2_file_path = './miRNA_TCGA_breast.csv'

# Gets Dataframes
df1 = dd.read_csv(
    df1_file_path,
    delimiter='\t',
    blocksize=BLOCKSIZE
)
first_column = df1.columns.values[0]
df1.set_index(first_column)
df2 = dd.read_csv(
    df2_file_path,
    delimiter='\t',
    blocksize=BLOCKSIZE
)
first_column = df2.columns.values[0]
df2.set_index(first_column)

# Filter common columns
common_columns = df1.columns.intersection(df2.columns)
df1 = df1[common_columns]
df2 = df2[common_columns]

Then, I make the operation storing on disk to prevent memory errors:

# Computes a Cartesian product
df1['_tmpkey'] = 1
df2['_tmpkey'] = 1

# Neither of these two options work
# df1.merge(df2, on='_tmpkey').drop('_tmpkey', axis=1).to_hdf('/tmp/merge.*.hdf', key='/merge_data')
# df1.merge(df2, on='_tmpkey').drop('_tmpkey', axis=1).to_parquet('/tmp/')

I've made a repo to try with exactly the same CSV files that I'm using. I've tried with smaller blocksize values but I got the same error. Am I missing something? Any kind of help would be really appreciated.

Genarito
  • 3,027
  • 5
  • 27
  • 53
  • Can you share the plain Pandas code, too? Do you have an example of what the output should look like? – AMC Jan 22 '20 at 23:22
  • Thank you for your interest! Sure I can! I've edited the question to add a link to another Stack Overflow's question where you can appreciate the code and a data format example – Genarito Jan 23 '20 at 13:20

1 Answers1

5

I successfully ran your code using the following method with a memory limited to 32GB.

I've get rid of the argument BLOCKSIZE and used repartition instead on df1 and df2.

df1 = df1.repartition(npartitions=50)
df2 = df2.repartition(npartitions=1)

Notice that the size of df2 is really smaller compared to df1 (2.5 MB vs 23.75 MB), that's why I've kept only one partition for df2 and cut df1 into 50 partitions.

And doing so should make the code work for you. For me, the memory used stayed below 12GB.

To check, I've computed the len of the result :

len(df) # 3001995

Following what's above creates a parquet file with 50 partitions. You can use repartition again to get to the partition_size you want.

NB:

Adding this should speed up your code:

from dask.distributed import Client
client = Client()

In my case, I had to use the argument Client(processes=False) because of my running environment.

DavidK
  • 2,495
  • 3
  • 23
  • 38
  • Thank you so much!! I'll try it and tell you – Genarito Jan 25 '20 at 01:24
  • You're welcome. Maybe you can try with less than 50. I've tried with 100 at first. – DavidK Jan 25 '20 at 01:35
  • Sure! As it could be run several times in parallel, probably I'll select a repartition size that consume less memory to avoid errora. Than you again! – Genarito Jan 25 '20 at 01:50
  • Hi! It works! Unfortunatelly it's extremelly slow as I have to set a higher number of partitions (I have only 8 GB of RAM). I'll try with Cython and Numba as proposed in [Pandas' docs](https://pandas.pydata.org/pandas-docs/stable/user_guide/enhancingperf.html) and see what happen – Genarito Jan 27 '20 at 14:16
  • 1
    I don't know if it is a possibility for you but you can use more memory than 8GB with https://colab.research.google.com – DavidK Jan 27 '20 at 20:12
  • Sure! But my code has to run on a server with that memory amount :( – Genarito Jan 27 '20 at 20:17
  • Could anyone explain why Dask needs 28GB memory to merge 2 ~20MB files? – ionox0 Jul 07 '22 at 16:13