I'm working with big CSV files and I need to make a Cartesian Product (merge operation). I've tried to face the problem with Pandas (you can check Panda's code and a data format example for the same problem, here) without success due to memory errors. Now, I'm trying with Dask, which is supposed to manage huge datasets even when its size is bigger than the available RAM.
First of all I read both CSV:
from dask import dataframe as dd
BLOCKSIZE = 64000000 # = 64 Mb chunks
df1_file_path = './mRNA_TCGA_breast.csv'
df2_file_path = './miRNA_TCGA_breast.csv'
# Gets Dataframes
df1 = dd.read_csv(
df1_file_path,
delimiter='\t',
blocksize=BLOCKSIZE
)
first_column = df1.columns.values[0]
df1.set_index(first_column)
df2 = dd.read_csv(
df2_file_path,
delimiter='\t',
blocksize=BLOCKSIZE
)
first_column = df2.columns.values[0]
df2.set_index(first_column)
# Filter common columns
common_columns = df1.columns.intersection(df2.columns)
df1 = df1[common_columns]
df2 = df2[common_columns]
Then, I make the operation storing on disk to prevent memory errors:
# Computes a Cartesian product
df1['_tmpkey'] = 1
df2['_tmpkey'] = 1
# Neither of these two options work
# df1.merge(df2, on='_tmpkey').drop('_tmpkey', axis=1).to_hdf('/tmp/merge.*.hdf', key='/merge_data')
# df1.merge(df2, on='_tmpkey').drop('_tmpkey', axis=1).to_parquet('/tmp/')
I've made a repo to try with exactly the same CSV files that I'm using. I've tried with smaller blocksize
values but I got the same error. Am I missing something? Any kind of help would be really appreciated.