0

I am running a program which needs to query a lot of single rows from a database. I have a list of (ticker,timestamp) pairs (in a pandas DataFrame) and I need to retrieve their corresponding price from a database. Right now, I am using pandas' .apply() to run a query against the database for every row in the pandas DataFrame. That is sequential. I find that that is too slow when I have too much data and was wondering how to improve the performance.

I see two methods, one that I think I prefer but I do not understand well.

In one case, I can make one session.query(T).filter(...) where I use

or_( 
    and_(
        T.ticker == tickername[0],
        T.timestamp == timestamp[0]
    ),
    and_(
        T.ticker == tickername[1],
        T.timestamp == timestamp[1]
    ),
 .... 
)

In the other case (that is the one I might prefer if there is an existing method of doing it), I would design one

q = session.query(T).filter(
    and_(
        T.ticker= tickername,
        T.timestamp == timestamp
    )
) 

for each (ticker,timestamp) pair, and I run all the q's in parallel against the database (postgres).

I've tried the first method and it seems like there is a bottleneck somewhere that causes things to hang.

  • htop shows that the process takes up 6.7% of the memory on the machine.
  • sqlalchemy might be bottlenecking, but I don't know how to check that.
  • postgres is using a lot of CPU% (htop shows 100% or under 100%), so maybe it is the one bottlenecking

I don't know of a better way to make a bulk query like this.

Ilja Everilä
  • 50,538
  • 7
  • 126
  • 127
Slackware
  • 960
  • 1
  • 13
  • 29
  • have you considered uploading pandas dataframe to database and doing a table join? – jimmu Jan 25 '18 at 00:02
  • Use `zip(tickername, timestamp)` to generate an iterable of `(ticker, timestamp)` tuples and follow the answer from the linked Q/A (`tuple_(T.ticker, T.timestamp).in_(pairs)`), if you don't have a really large amount of pairs. If you do, follow @Krishna's advice. – Ilja Everilä Jan 25 '18 at 06:27
  • Will those pairs select most of the table's data, and if not, do you have a suitable index? If you're asking SQL performance questions, include the related table schema. – Ilja Everilä Jan 25 '18 at 06:33

0 Answers0