3

I would like to store the stock price of a large number of companies in a parquet file in the form of a timeseries.
If I gather the data at the end of 1 Jul, I would be writing a file such as:

1 Jul 2020, Company1,35  
1 Jul 2020, Company2,46  
....

On 2 Jul, I would receive the new prices and would write it in "append" mode as:

2 Jul 2020, Company1,37  
2 Jul 2020, Company2,43  
...  

This will result in 2 partition files being created for the same parquet file:

stocks.parquet/   
    part0_stocks.parquet written on 1 Jul  
    part1_stocks.parquet written on 2 Jul

If this continues for years, I will have a large number of partition files created, one per day. If a client application wants to fetch the timeseries for 6 months, it will be reading several files to gather the data and may be inefficient.

Is there a better way to store timeseries data in parquet?

Yash
  • 946
  • 1
  • 13
  • 28

0 Answers0