I am new to Dask and having some troubles with it.
I am using a machine ( 4GB RAM, 2 cores) to analyse two csv files ( key.csv: ~2 million rows about 300Mb, sig.csv: ~12 million row about 600Mb). With these data, pandas can't fit in the memory, so I switch to use Dask.dataframe, What I expect is that Dask will process things in small chunks that can be fit in the memory ( the speed can be slower, i don't mind at all as long as it works), however, somehow, Dask still uses up all of the memory.
My code as below:
key=dd.read_csv("key.csv")
sig=dd.read_csv("sig.csv")
merge=dd.merge(key, sig, left_on=["tag","name"],
right_on=["key_tag","query_name"], how="inner")
merge.to_csv("test2903_*.csv")
# store results into a hard disk since it can't be fit in memory
Did I make any mistakes? Any help is appreciated.