5

I currently have a problem with my Python script that when I execute a SQL function from cursor.execute() for a specific function, It runs the function, function completes, and then the script hangs for somewhere close to five hours before resuming the rest of the program.

I watch the server status of my function in pgAdminIII as it is called from Python and when the function completes the server just shows an open connection to my python-user. Before it would idle in transaction, but after adding conn.set_isolation_level(0), it only shows the open connection.

I have also put prints surrounding the cursor.execute() but it takes five hours to proceed to the following print statement. I have tried killing the connection and restarting it to no avail. I stripped the function call completely out of my script so it was the only process my script handled. Nothing has worked.

The difficulty is that both sides keep pointing to the other as the source of the problem. I am assuming it is some sort of interaction effect from this specific function. Perhaps a signal is not being sent back to python from postgres after running the specific function? None of the other functions have an issue with the same method of execution.

Ideas? Solutions? -I was considering a timeout, but the runtime can vary anywhere from 20 minutes to 2 hours. -The function it runs makes a table and returns only a string stating 'Complete', all operations are accounted for in the database after it disappears as an active function in server status.

Here is the code I use for replicating the problem, isolated from my main script:

#!/usr/bin/env python

import sys
import os
import smtplib
import ftplib
import gzip
import logging
import shutil
import argparse
import traceback
import subprocess
from email.mime.text import MIMEText

import psycopg2
import psycopg2.extras
import psycopg2.extensions
from psycopg2 import OperationalError

pg_db    = "ppl"
pg_port  = "5432"
pg_user  = "<user>"
pg_pass  = "<pass>"
pg_host  = "<host>"

# connect to database
try:
    conn = psycopg2.connect(database=pg_db,
                            port=pg_port, 
                            user=pg_user, 
                            password=pg_pass, 
                            host=pg_host)
    conn.set_isolation_level(0)
except:
    print "Unable to connect to the database."

cur = conn.cursor()

print 'Starting select <function1>'
cur.execute("select <function1>(999999, null );")
print 'Assign variable...'
a = cur.fetchall()
print 'Done'

print 'Starting select <function2>'
cur.execute("select <function2>(999999, null );")
print 'Assign variable...'
b = cur.fetchall()
print 'Done'

print a
print b

conn.close()
sys.exit()

Then I ran a strace as suggested with these results:

<previous function statements for about 4k lines>

write(1, "Starting select <function>"..., 45) = 45
sendto(3, "Q\0\0\0009select <function>"..., 58, MSG_NOSIGNAL, NULL, 0) = 58
poll([{fd=3, events=POLLIN|POLLERR}], 1, -1) = 1 ([{fd=3, revents=POLLIN}])
recvfrom(3, "T\0\0\0001\0\1<function_name>\0"..., 16384, 0, NULL, NULL) = 500
poll([{fd=3, events=POLLIN|POLLERR}], 1, -1) = 1 ([{fd=3, revents=POLLIN}])
recvfrom(3, "N\0\0\0mSINFO\0C00000\0Mv_eval_table "..., 16384, 0, NULL, NULL) = 110
poll([{fd=3, events=POLLIN|POLLERR}], 1, -1) = 1 ([{fd=3, revents=POLLIN}])
recvfrom(3, "N\0\0\4\247SINFO\0C00000\0Mcmd := CREATE"..., 16384, 0, NULL, NULL) = 1192
poll([{fd=3, events=POLLIN|POLLERR}], 1, -1) = 1 ([{fd=3, revents=POLLIN}])
recvfrom(3, "N\0\0\23hSINFO\0C00000\0Mcmd := \t--For"..., 16384, 0, NULL, NULL) = 4969
poll([{fd=3, events=POLLIN|POLLERR}], 1, -1) = 1 ([{fd=3, revents=POLLIN}])
recvfrom(3, "N\0\0\0SSINFO\0C00000\0Mlicense_row_i"..., 16384, 0, NULL, NULL) = 84
poll([{fd=3, events=POLLIN|POLLERR}], 1, -1) = 1 ([{fd=3, revents=POLLIN}])
recvfrom(3, "N\0\0\0SSINFO\0C00000\0Mlicense_row_i"..., 16384, 0, NULL, NULL) = 672
...
brk(0xbe9000)                           = 0xbe9000
poll([{fd=3, events=POLLIN|POLLERR}], 1, -1) = 1 ([{fd=3, revents=POLLIN}])
recvfrom(3, "N\0\0\0SSINFO\0C00000\0Mlicense_row_i"..., 16384, 0, NULL, NULL) = 84
poll([{fd=3, events=POLLIN|POLLERR}], 1, -1) = 1 ([{fd=3, revents=POLLIN}])
recvfrom(3, "N\0\0\0SSINFO\0C00000\0Mlicense_row_i"..., 16384, 0, NULL, NULL) = 168
poll([{fd=3, events=POLLIN|POLLERR}], 1, -1) = 1 ([{fd=3, revents=POLLIN}])
recvfrom(3, "N\0\0\0SSINFO\0C00000\0Mlicense_row_i"..., 16384, 0, NULL, NULL) = 1008
...

<does this for about 600k more lines before executing the rest of the script, I am assuming it is indefinitely looping until a timeout>

Next I ran a cProfile for a function that wrapped the second block of code that called the second function:

Tue Jan 15 16:04:50 2013    function2_prof

         7 function calls in 15755.988 CPU seconds

   Random listing order was used

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.000    0.000 {method 'cursor' of 'psycopg2._psycopg.connection' objects}
        1    0.000    0.000 15755.988 15755.988 test_sql_sub2.py:48(LR2)
        1 15755.988 15755.988 15755.988 15755.988 {method 'execute' of 'psycopg2._psycopg.cursor' objects}
        1    0.000    0.000 15755.988 15755.988 <string>:1(<module>)
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
        1    0.000    0.000    0.000    0.000 {method 'close' of 'psycopg2._psycopg.cursor' objects}
        1    0.000    0.000    0.000    0.000 {method 'fetchall' of 'psycopg2._psycopg.cursor' objects}
General Grievance
  • 4,555
  • 31
  • 31
  • 45
BagoDev
  • 313
  • 6
  • 12
  • Tried creating separate cursors for each function call, still unsuccessful in getting past the hang. – BagoDev Jan 15 '13 at 02:51
  • since it does resume (eventually) rather than hanging forever, you might try sniffing the traffic, attaching strace to python, etc. and watching what actually happens during those five hours – Eevee Jan 15 '13 at 03:08
  • a few recommendations: You should really include version information in your question (OS, Postgresql, Python, etc); People will usually ask for it. Next, you should give the function definitions. Finally, I would try calling your function with parameters. it's just usually a good idea. I'm not sure if any of this will help, but I sure hope so. – David S Jan 15 '13 at 03:49
  • I will try to strace and post my results – BagoDev Jan 15 '13 at 04:00
  • Its python 2.7.3, on unix, postgresql. Also I ran the function manually in postgresql pgAdminIII and it was successful without the hang. – BagoDev Jan 15 '13 at 04:01
  • I created a profiler for the python script, I will post those results when it finishes the 5 hour hang. – BagoDev Jan 15 '13 at 20:38
  • Completed a pstats on the profiler not sure if any new information can be ascertained from the file. – BagoDev Jan 16 '13 at 00:45

2 Answers2

1

I think your function does more than you realize. Look at your strace:

brk(0xbe9000)                           = 0xbe9000
poll([{fd=3, events=POLLIN|POLLERR}], 1, -1) = 1 ([{fd=3, revents=POLLIN}])
recvfrom(3, "N\0\0\0SSINFO\0C00000\0Mlicense_row_i"..., 16384, 0, NULL, NULL) = 84
poll([{fd=3, events=POLLIN|POLLERR}], 1, -1) = 1 ([{fd=3, revents=POLLIN}])
recvfrom(3, "N\0\0\0SSINFO\0C00000\0Mlicense_row_i"..., 16384, 0, NULL, NULL) = 168
poll([{fd=3, events=POLLIN|POLLERR}], 1, -1) = 1 ([{fd=3, revents=POLLIN}])
recvfrom(3, "N\0\0\0SSINFO\0C00000\0Mlicense_row_i"..., 16384, 0, NULL, NULL) = 1008

brk means "give me more memory", so clearly Python is accumulating data. poll checks whether there's any data waiting; a positive return value (here, 1) means there is. And recvfrom reads that data and returns the number of bytes read.

So it looks like Postgres is sending you a lot of data, so much that it takes a significant amount of time to receive. I don't know much about Postgres's wire format, but the license_row_i... in every poll sure sounds like you're getting back tons and tons of rows. The function no longer shows as executing because it isn't; the work is done, and Postgres is now just sending the results back.

It's no wonder both sides are blaming the other; the problem is in the middle. :)


edit: Aha, more information implicates RAISE INFO. Psycopg2 does indeed store all of those notices, on the connection object.

It only keeps the last fifty, but if you're sending over half a million notices to the client, it'll take a while to keep turning them into Python strings, throwing away the oldest, appending the newest, etc. If the command-line client is ignoring them entirely (or even just doing this all in C), it'll naturally be much faster.

So the easy solution is "don't do that". :) You can also configure Postgres not to send these to the client, but of course they'd end up useless for debugging. There's no way, that I can see, to ask psycopg2 not to collect notices sent to the client.

Eevee
  • 47,412
  • 11
  • 95
  • 127
  • Well see that is the issue, the only thing the function returns is a single row that has one string that says 'Complete'. I have also run this manually on pgAdminIII and from Bash, both take roughly 20 minutes. I am not able to provide the function, but I can say it basically builds a table from a different tables across different schemas and then finishes by passing the name of the newly created table. So if it is not returning tons and tons of rows what could it be sending back to python? Does it give more than just the return? – BagoDev Jan 16 '13 at 02:08
  • without seeing the function, i have no idea. but you should have some results in `a` and `b`; print them out and see what they are. – Eevee Jan 16 '13 at 02:29
1

The issue was due to a single line in multiple for-loops in the SQL function:

raise info 'license_row_id := %', rec.license_row_id;

These raise info messages were somehow not handled appropriately by python as they were by bash and pgAdmin. My guess is that because they were unhandled, they kept looping and processing and asking for more memory until it was unable to process, finally dropping the raise infos without warning or error and resuming the program.

I commented out this needless raise info line and the whole script ran in 20 minutes, a normal run time for the program. If anyone has an idea how to make python handle raise info messages from postgresql that would probably even further answer this problem.

BagoDev
  • 313
  • 6
  • 12