I would like to store the stock price of a large number of companies in a parquet file in the form of a timeseries.
If I gather the data at the end of 1 Jul, I would be writing a file such as:
1 Jul 2020, Company1,35
1 Jul 2020, Company2,46
....
On 2 Jul, I would receive the new prices and would write it in "append" mode as:
2 Jul 2020, Company1,37
2 Jul 2020, Company2,43
...
This will result in 2 partition files being created for the same parquet file:
stocks.parquet/
part0_stocks.parquet written on 1 Jul
part1_stocks.parquet written on 2 Jul
If this continues for years, I will have a large number of partition files created, one per day. If a client application wants to fetch the timeseries for 6 months, it will be reading several files to gather the data and may be inefficient.
Is there a better way to store timeseries data in parquet?