Improving Postgres psycopg2 query performance for Python to the same level of Java's JDBC driver

Question

Overview

I'm attempting to improve the performance of our database queries for SQLAlchemy. We're using psycopg2. In our production system, we're chosing to go with Java because it is simply faster by at least 50%, if not closer to 100%. So I am hoping someone in the Stack Overflow community has a way to improve my performance.

I think my next step is going to be to end up patching the psycopg2 library to behave like the JDBC driver. If that's the case and someone has already done this, that would be fine, but I am hoping I've still got a settings or refactoring tweak I can do from Python.

Details

I have a simple "SELECT * FROM someLargeDataSetTable" query running. The dataset is GBs in size. A quick performance chart is as follows:

Timing Table

        Records    | JDBC  | SQLAlchemy[1] |  SQLAlchemy[2] |  Psql
-------------------------------------------------------------------- 
         1 (4kB)   | 200ms |         300ms |          250ms |   10ms
        10 (8kB)   | 200ms |         300ms |          250ms |   10ms
       100 (88kB)  | 200ms |         300ms |          250ms |   10ms
     1,000 (600kB) | 300ms |         300ms |          370ms |  100ms
    10,000 (6MB)   | 800ms |         830ms |          730ms |  850ms  
   100,000 (50MB)  |    4s |            5s |           4.6s |     8s
 1,000,000 (510MB) |   30s |           50s |            50s |  1m32s  
10,000,000 (5.1GB) | 4m44s |         7m55s |          6m39s |    n/a
-------------------------------------------------------------------- 
 5,000,000 (2.6GB) | 2m30s |         4m45s |          3m52s | 14m22s
-------------------------------------------------------------------- 
[1] - With the processrow function
[2] - Without the processrow function (direct dump)

I could add more (our data can be as much as terabytes), but I think changing slope is evident from the data. JDBC just performs significantly better as the dataset size increases. Some notes...

Timing Table Notes:

The datasizes are approximate, but they should give you an idea of the amount of data.
I'm using the 'time' tool from a Linux bash commandline.
The times are the wall clock times (i.e. real).
I'm using Python 2.6.6 and I'm running with python -u
Fetch Size is 10,000
I'm not really worried about the Psql timing, it's there just as a reference point. I may not have properly set fetchsize for it.
I'm also really not worried about the timing below the fetch size as less than 5 seconds is negligible to my application.
Java and Psql appear to take about 1GB of memory resources; Python is more like 100MB (yay!!).
I'm using the [cdecimals] library.
I noticed a [recent article] discussing something similar to this. It appears that the JDBC driver design is totally different to the psycopg2 design (which I think is rather annoying given the performance difference).
My use-case is basically that I have to run a daily process (with approximately 20,000 different steps... multiple queries) over very large datasets and I have a very specific window of time where I may finish this process. The Java we use is not simply JDBC, it's a "smart" wrapper on top of the JDBC engine... we don't want to use Java and we'd like to stop using the "smart" part of it.
I'm using one of our production system's boxes (database and backend process) to run the query. So this is our best-case timing. We have QA and Dev boxes that run much slower and the extra query time can become significant.

testSqlAlchemy.py

#!/usr/bin/env python
# testSqlAlchemy.py
import sys
try:
    import cdecimal
    sys.modules["decimal"]=cdecimal
except ImportError,e:
    print >> sys.stderr, "Error: cdecimal didn't load properly."
    raise SystemExit
from sqlalchemy import create_engine
from sqlalchemy.orm import sessionmaker

def processrow (row,delimiter="|",null="\N"):
    newrow = []
    for x in row:
        if x is None:
            x = null
        newrow.append(str(x))
    return delimiter.join(newrow)

fetchsize = 10000
connectionString = "postgresql+psycopg2://usr:pass@server:port/db"
eng = create_engine(connectionString, server_side_cursors=True)
session = sessionmaker(bind=eng)()

with open("test.sql","r") as queryFD:
   with open("/dev/null","w") as nullDev:
        query = session.execute(queryFD.read())
        cur = query.cursor
        while cur.statusmessage not in ['FETCH 0','CLOSE CURSOR']:
            for row in query.fetchmany(fetchsize):
                print >> nullDev, processrow(row)

After timing, I also ran a cProfile and this is the dump of worst offenders:

Timing Profile (with processrow)

Fri Mar  4 13:49:45 2011    sqlAlchemy.prof

         415757706 function calls (415756424 primitive calls) in 563.923 CPU seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.001    0.001  563.924  563.924 {execfile}
        1   25.151   25.151  563.924  563.924 testSqlAlchemy.py:2()
     1001    0.050    0.000  329.285    0.329 base.py:2679(fetchmany)
     1001    5.503    0.005  314.665    0.314 base.py:2804(_fetchmany_impl)
 10000003    4.328    0.000  307.843    0.000 base.py:2795(_fetchone_impl)
    10011    0.309    0.000  302.743    0.030 base.py:2790(__buffer_rows)
    10011  233.620    0.023  302.425    0.030 {method 'fetchmany' of 'psycopg2._psycopg.cursor' objects}
 10000000  145.459    0.000  209.147    0.000 testSqlAlchemy.py:13(processrow)

Timing Profile (without processrow)

Fri Mar  4 14:03:06 2011    sqlAlchemy.prof

         305460312 function calls (305459030 primitive calls) in 536.368 CPU seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.001    0.001  536.370  536.370 {execfile}
        1   29.503   29.503  536.369  536.369 testSqlAlchemy.py:2()
     1001    0.066    0.000  333.806    0.333 base.py:2679(fetchmany)
     1001    5.444    0.005  318.462    0.318 base.py:2804(_fetchmany_impl)
 10000003    4.389    0.000  311.647    0.000 base.py:2795(_fetchone_impl)
    10011    0.339    0.000  306.452    0.031 base.py:2790(__buffer_rows)
    10011  235.664    0.024  306.102    0.031 {method 'fetchmany' of 'psycopg2._psycopg.cursor' objects}
 10000000   32.904    0.000  172.802    0.000 base.py:2246(__repr__)

Final Comments

Unfortunately, the processrow function needs to stay unless there is a way within SQLAlchemy to specify null = 'userDefinedValueOrString' and delimiter = 'userDefinedValueOrString' of the output. The Java we are using currently already does this, so the comparison (with processrow) needed to be apples to apples. If there is a way to improve the performance of either processrow or SQLAlchemy with pure Python or a settings tweak, I'm very interested.

If you take a look at the timing, a majority of the duration falls under: {method 'fetchmany' of 'psycopg2._psycopg.cursor' objects}. That could be the ORM (like SQL Alchemy) calls, but I really think it's the psycopg2 c-code / library which wraps around libpq from postgres. And based on the other thread I linked, my gut tells me that this means it's a psycopg2 problem and the way it uses libpq. — Brian Bruggeman, Mar 14 '11 at 13:54
Have you considered Jython? With that, you can code in Python but could still utilize JDBC driver. OTOH, it seems an ETL (Informatica or Talend) tool would a be a better solution than Python or Java. — Shamit Verma, Mar 16 '11 at 11:14
The goal (or experiment at this point...), is to produce a CPython centric equivalent of our current process -- a mix of gnu make, c++, java, ruby, python, perl and bash. We really like CPython because of its ease of use and the access to the c-libraries. Jython is a mixed bag of great and good things (JIT, JDBC, Java libraries), but it is also lagging behind in development (python 3.x, 2.6, 2.7, etc). We do use some of the later language features. We also have a full website alongside our backend processing, and we're considering replacing all Java with CPython. — Brian Bruggeman, Mar 16 '11 at 13:18
Regarding the memory consumption of the JDBC solution: by default the driver first loads the complete result into memory before you can iterate over it using `next()`. This behaviour can be changed, and then you should see less memory used by the JDBC driver: http://jdbc.postgresql.org/documentation/head/query.html#query-with-cursor — , Jun 11 '11 at 08:04
The 100M vs 1GB doesn't make a lot of sense. Have you tried to increase the fetchsize? — Samuel, Jun 21 '11 at 09:36
@Samuel: I think you may need to go back and look at the chart again, but I believe you're referring to 50MB vs 550MB. And my guess is that there is some extra database latency introduced (e.g. a table scan..) due to the size of the request. Each of the lines used 'LIMIT' to grab an exact subset of the total data. My guess is that the bigger LIMIT caused something special to happen in the database which was optimized with a lower LIMIT. — Brian Bruggeman, Jun 22 '11 at 14:06
@Samuel: I did play around with the fetchsize and I did not see an appreciable difference in performance when I had any increment of 1,000 between 4,000 and 20,000. It appeared to be fairly close to optimal at a fetchsize of 10,000 +- 2,000, but that was really minor. — Brian Bruggeman, Jun 22 '11 at 14:10
@Brian Bruggeman I was referring to the comment "Java and Psql appear to take about 1GB of memory resources; Python is more like 100MB (yay!!)." I can't imagine why of that difference, but probably the advice of increasing the fetchsize was a blind shot. May be it's a gc issue? Disabling the gc and taking a look may be worth the effort. Other options? Compiling python and libpq with gprof and/or taking a tcpdump to see if the slowdown it's in the client or in the server... — Samuel, Jun 22 '11 at 19:27
Unfortunately -- unlike JDBC -- psycopg2 doesn't support proper parameterized queries; it always embeds parameter literals into the original SQL query instead. This wastes resources on both the client end (for string formatting) and server (parsing). This could explain *some* of the timing difference you're seeing. There are alternative Postgres drivers, but I don't know whether they are any better. — intgr, Nov 02 '11 at 15:21
Have you tried benchmarking psycopg2 itself instead of an ORM like SQLAlchemy? — fog, Feb 13 '13 at 08:59

score 3 · Answer 1 · answered Dec 02 '11 at 07:00

This is not an answer out of the box, with all client/db stuff you may need to do some work to determine exactly what is amiss

backup postgresql.conf changing

log_min_duration_statement to 0 
log_destination = 'csvlog'              # Valid values are combinations of      
logging_collector = on                # Enable capturing of stderr and csvlog 
log_directory = 'pg_log'                # directory where log files are written,
log_filename = 'postgresql-%Y-%m-%d_%H%M%S.log' # log file name pattern,        
debug_print_parse = on
debug_print_rewritten = on
debug_print_plan output = on
log_min_messages = info (debug1 for all server versions prior to 8.4)

Stop and restart your database server ( reload may not pick up the changes ) Reproduce your tests ensuring that the server time and client times match and that you record the start times etc.

copy the log file off an import into editor of your choice (excel or another spreadsheet can be useful for getting advance manipulation for sql & plans etc)

now examine the timings from the server side and note:

is the sql reported on the server the same in each case
if the same you should have the same timings
is the client generating a cursor rather than passing sql
is one driver doing a lot of casting/converting between character sets or implicit converting of other types such as dates or timestamps.

and so on

The plan data will be included for completeness, this may inform if there are gross differences in the SQL submitted by the clients.

score 2 · Answer 2 · edited Sep 27 '11 at 14:10

2

The stuff below is probably aiming above and beyond what you have in mind or what is deemed acceptable in your environment, but I'll put the option on the table just in case.

Is the destination of every SELECT in your test.sql truly a simple |-separated results file?
Is non-portability (Postgres-specificity) acceptable?
Is your backend Postgres 8.2 or newer?
Will the script run on the same host as the database backend, or would it be acceptable to generate the |-separated results file(s) from within the backend (e.g. to a share?)

If the answer to all of the above questions is yes, then you can transform your SELECT ... statements to COPY ( SELECT ... ) TO E'path-to-results-file' WITH DELIMITER '|' NULL E'\\N'.

edited Sep 27 '11 at 14:10

answered Jun 11 '11 at 04:32

vladr

65,483
18
129
130

We do use copies where they are appropriate. Unfortunately, most of our sql scripts are much more complicated and require database scans. We end up needing to optimize them. So while a copy might work in a very basic 'move', we do quite a bit more than that and need a universal solution. – Brian Bruggeman Jun 12 '11 at 18:52
@Brian, just to confirm, you are aware that you can feed arbitrary `SELECT` statements (including `JOIN`s IIRC) to `COPY`, right? – vladr Jun 13 '11 at 03:53
Yes. However, COPY is optimized for moving data to and from files, and we use it for that. But that only gets us so far. We need something that can also be used as an ORM. The data I presented above is just straight SQL queries without the ORM overhead. And COPY really does not address my original problem: we want a working python solution. – Brian Bruggeman Jun 13 '11 at 12:37

score 1 · Answer 3 · answered Mar 16 '11 at 15:53

1

An alternative could be to use ODBC. This is assuming that Python ODBC driver performs well.

PostgreSQL has ODBC drivers for both Windows and Linux.

answered Mar 16 '11 at 15:53

Shamit Verma

3,839
23
22

score 0 · Answer 4 · answered Sep 03 '11 at 14:14

As someone who programmed mostly in assembler, there is one thing that sticks out as obvious. You are losing time in the overhead, and the overhead is what needs to go.

Rather than using python, which wraps itself in something else that integrates with something that is a C wrapper around the DB.... just write the code in C. I mean, how long can it take? Postgres is not hard to interface with (quite the opposite). C is an easy langauge. The operations you are performing seem pretty straightforward. You can also use SQL embedded in C, it's just a matter of a pre-compile. No need to translate what you were thinking - just write it there along with the C and use the supplied ECPG compiler (read postgres manual chapter 29 iirc).

Take out as much of the in-betweeny interfacey stuff as you can, cut out the middle man and get talking to the database natively. It seems to me that in trying to make the system simpler you are actually making it more complicated than it needs to be. When things are getting really messy, I usually ask myself the question "What bit of the code am I most afraid to touch?" - that usually points me to what needs changing.

Sorry for babbling on, but maybe a step backward and some fresh air will help ;)

I actually come from a background of embedded software, writing in C and assembly is something I did quite often. In this case, however, I had included the Psql line for a reference point of 'pure' c-code. So your proposal to remove 'interfacy' stuff did not make too much sense to me... And our goal within the group is to move completely away from compiled languages since they are limiting. Had we picked Jython, this would be a non issue since Jython would use the same library as Java... which is faster than C with big data (see PSQL line). — Brian Bruggeman, Sep 04 '11 at 06:04