I am basically converting some csv
files to parquet
. To do so, I decided to use dask
, read the csv
on dask
and write it back to parquet
. I am using a big blocksize as the customer requested (500 MB). The csv
's are 15 GB and bigger (until 50 GB), the machine has 64 GB RAM. Whenever I am running the basic to_parquet
command, RAM starts increasing and eventually is so high that linux kills the process. Does somebody know why this happens? When I dont specify blocksizes, it works but it creates a lot of small parquet files (24 MB). Is there a way to solve this creating blocks of at least 500 MB.
_path = 'E://'
dt = dd.read_csv(_path+'//temporal.csv', blocksize = 500e5)
dt.to_parquet(path=_path+'/t.parq', compression='gzip')`