1

I have a table with around 70000 rows, and I am trying to load the entire table in Python using the psycopg2 module.

When I use this code, it takes around 110 seconds only to load the table.

with conn.cursor() as cursor:
     cursor.execute("""SELECT * FROM Table;""")
     rows = cursor.fetchall()
     print(rows)

I want to ask what is the bottleneck job in the above Python code. Is it reading the table or converting the data into Python dictionaries and tuples? And is there any way to improve the runtime of the Python code?

Now, when I use the below code, it is able to load the entire table and print it in a file almost instantly, which indicates loading is not the problem.

COPY (SELECT * FROM Table) TO 'file.txt';

Why is there so much runtime difference between these two cases?

The table consists of id and a dictionary in the form of JSONB.

Postgres 14.7

OS: Ubuntu 22.04 WSL (Everything running within the WSL)

Edit: This command from the terminal also takes only 0.83 seconds.

bin/psql -d db -c "SELECT * FROM Table;" -o output_file.txt
abhishek2046
  • 312
  • 1
  • 11
  • 1
    Is the `psycopg2` code being run from a machine remote to the server? – Adrian Klaver Jul 17 '23 at 18:35
  • how much time does the request take when ran from the mysql server? – Gugu72 Jul 17 '23 at 18:36
  • 3
    Why do you want to fetch all rows from the table? This sounds wrong and I recommend to add a WHERE clause unless there is a extreme good reason to do not restrict the result. Otherwise, your query will even be slower if the table has further rows. 70000 rows are not much, but maybe some day, the number of rows will have some more zeros at the end. – Jonas Metzler Jul 17 '23 at 18:36
  • @AdrianKlaver Everything is local – abhishek2046 Jul 17 '23 at 18:40
  • I just a test retrieving ~90000 rows and it took 705 ms. Even running the `print(rows)` only increased the time to 4.46s. – Adrian Klaver Jul 17 '23 at 18:41
  • This question is going to need more information, **add as text update**: 1) The table definition. 2) More detail on what `Ubuntu 22.04 WSL` means? Is that where the Postgres server is running, the Python client code or both? 3) Where and how are you measuring the time? – Adrian Klaver Jul 17 '23 at 18:49
  • How fast is it if you replace `print(rows)` with `print(rows.count)` (or whatever the equivalent is?) I think you've established the issue is not the database or reading data out of it, it's getting into whatever structure python has. Are you certain you need to load all the rows locally? What are you doing with it? It might be possible to use native database functions to do what you need. Alternatively this truly might be a case where the processing can't be done in the db, it can only be done in Python. – Nick.Mc Jul 17 '23 at 23:02

1 Answers1

1

I believe you can use a server-side cursor and iterate over it to improve your performance. So, you should use the following commands:

cursor = conn.cursor('cursor_unique_name')
cursor.execute("""SELECT * FROM Table;""")
for row in cursor:
    print "%s\n" % (row)

When you specify a name for the cursor, psycopg2 creates a server-side cursor, which prevents all of the records from beign downloaded at once from the server. Read more about it here.

Wendel
  • 763
  • 1
  • 12
  • 1) What is this `print "%s\n" % (row)` supposed to be doing? `print` quit being a statement in Python 3. Why not `for row in cur: print(row)`? 2) For the relatively small number of rows involved this is not going to make much difference. – Adrian Klaver Jul 18 '23 at 15:13