I am running a program which needs to query a lot of single rows from a database. I have a list of (ticker,timestamp) pairs (in a pandas DataFrame) and I need to retrieve their corresponding price from a database. Right now, I am using pandas' .apply() to run a query against the database for every row in the pandas DataFrame. That is sequential. I find that that is too slow when I have too much data and was wondering how to improve the performance.
I see two methods, one that I think I prefer but I do not understand well.
In one case, I can make one session.query(T).filter(...)
where I use
or_(
and_(
T.ticker == tickername[0],
T.timestamp == timestamp[0]
),
and_(
T.ticker == tickername[1],
T.timestamp == timestamp[1]
),
....
)
In the other case (that is the one I might prefer if there is an existing method of doing it), I would design one
q = session.query(T).filter(
and_(
T.ticker= tickername,
T.timestamp == timestamp
)
)
for each (ticker,timestamp) pair, and I run all the q's in parallel against the database (postgres).
I've tried the first method and it seems like there is a bottleneck somewhere that causes things to hang.
- htop shows that the process takes up 6.7% of the memory on the machine.
- sqlalchemy might be bottlenecking, but I don't know how to check that.
- postgres is using a lot of CPU% (htop shows 100% or under 100%), so maybe it is the one bottlenecking
I don't know of a better way to make a bulk query like this.