I have a large parquet file where the data in one of the columns is sorted. A very simplified example is below.
X Y
0 1 Red
1 5 Blue
2 8 Green
3 12 Purple
4 15 Blue
5 17 Purple
I am interested in querying the last value of column Y given that X is less than some amount in the most efficient way possible using python.
I am guaranteed that column X is sorted in ascending order.
As an example, given that X is less than 11, I would expect a Y value of "Green".
I have tried the following:
columns='Y'
filters=[('X','<',11]
pd.read_parquet('my_data.parquet',filters=filters,columns=columns).tail(1)
The code above "works" but I am hoping optimizations are possible as this query is run 1M+ times per day.
The parquet file is too large to be read into memory.
I cannot put a starting value for column "X", as there is no guarantee of the size of the gap between values of X. For example, if I were to require "X > 10 and X < 11" I would not get a value for Y returned.
I was hoping given the fact the data is sorted there is a way to optimize this.
I am open to using DuckDB or some other library to do this.