I have a set of CSV files, each for one year of data, with YEAR
column in each. I want to convert them into single parquet dataset, partitioned by year, for later use in pandas. The problem is that dataframe with all years combined is too large to fit in memory. Is it possible to write parquet partitions iteratively, one by one?
I am using fastparquet
as engine.
Simplified code example. This code blows up memory usage and crashes.
df = []
for year in range(2000, 2020):
df.append(pd.read_csv(f'{year}.csv'))
df = pd.concat(df)
df.to_parquet('all_years.pq', partition_cols=['YEAR'])
I tried to write years one by one, like so.
for year in range(2000, 2020):
df = pd.read_csv(f'{year}.csv')
df.to_parquet('all_years.pq', partition_cols=['YEAR'])
The data files are all there in their respective YEAR=XXXX
directories, but when I try to read such a dataset, I only get the last year. Maybe it is possible to fix the parquet metadata after writing separate partitions?