9

I'm doing something among the lines of:

conn_string = "postgresql+pg8000://%s:%s@%s:%d/%s" % (db_user, db_pass, host, port, schema)
conn = sqlalchemy.engine.create_engine(conn_string,execution_options={'autocommit':True},encoding='utf-8',isolation_level="AUTOCOMMIT") 
rows = cur.execute(sql_query)

To run queries on a Redshift cluster. Lately, I've been doing maintenance tasks such as running vacuum reindex on large tables that get truncated and reloaded every day.

The problem is that that command above takes around 7 minutes for a particular table (the table is huge, 60 million rows across 15 columns) and when I run it using the method above it just never finishes and hangs. I can see in the cluster dashboard in AWS that parts of the vacuum command are being run for about 5 minutes and then it just stops. No python errors, no errors on the cluster, no nothing.

My guess is that the connection is lost during the command. So, how do I prove my theory? Anybody else with the issue? What do I change the connection string to keep it alive longer?

EDIT:

I change my connection this after the comments here:

conn = sqlalchemy.engine.create_engine(conn_string,
                                       execution_options={'autocommit': True},
                                       encoding='utf-8',
                                       connect_args={"keepalives": 1, "keepalives_idle": 60,
                                                             "keepalives_interval": 60},  
                                                        isolation_level="AUTOCOMMIT")

And it has been working for a while. However, it decided to start with the same behaviour for even larger tables in which the vacuum reindex actually takes around 45 minutes (at least that is my estimate, the command never finishes running in Python).

How can I make this work regardless of the query runtime?

Nazim Kerimbekov
  • 4,712
  • 8
  • 34
  • 58
rodrigocf
  • 1,951
  • 13
  • 39
  • 62
  • 4
    In my experience, a connection drop like that (if it indeed is a connection drop) is due to a NAT silently dropping a connection somewhere in the middle. Incidentally, the timeout on AWS NAT gateways is 5 minutes. The tricky thing here is, psycopg2/libpq apparently has TCP keepalive turned on by default, which *should* fix the issue. You can try setting it explicitly: `create_engine(connect_args={"keepalives": 1, "keepalives_idle": 60, "keepalives_interval": 60})`; it is possible that the default keepalive interval is too long. – univerio Apr 24 '17 at 23:05
  • @univerio those keep alive arguments made this work today with no issues at all. If it works tomorrow I'll comment here again to ask you put this as an actual answer. This just makes me wonder why if it is turned on by default I still have to manually add it... – rodrigocf May 11 '17 at 15:53
  • @rodrigocf Does the solution work consistently? I'm facing the same problem.. – rdeboo May 30 '17 at 11:49
  • @rdeboo Yes. Since applied it hasn't failed. – rodrigocf May 30 '17 at 15:23
  • I am facing the same issues using airflow with Redshift. @univerio's comment solve it and should be an accepted answer here – Chengzhi Jul 28 '17 at 02:38

1 Answers1

2

It's most likely not a connection drop issue. To confirm this , try pushing a few million rows into a dummy table (something which takes more than 5 minutes) and see if the statement fails. Once a query has been submitted to redshift , regardless of your connection string shutting the query executes in the background.

Now, coming to the problem itself - my guess is that you are running out of memory or disk space, can you please be more elaborate and list out your redshift setup (How many nodes of dc1/ds2) ? Also, try running some admin queries and see how much space you have left on the disk. Sometimes when the cluster is loaded to the brim a disk full error is thrown but in your case since the connection might be dropped much before the error is thrown to your python shell.

Satyadev
  • 626
  • 4
  • 10
  • Memory or disk space is actually not an issue, I have a 4 node ds2.8xlarge and during the command the cpu usage doesn't go above 60% and the disk space remains steady at 30%. – rodrigocf Jan 12 '18 at 12:35