1

I'm connected to a read-only Redshift instance from data vendor (through JDBC and no other AWS tools available) that houses a few hundreds of different tables. Now I need to derive a few tables from them and, because the Redshift is read-only, load to my remote server. (Note: I have limited-time access to the Redshift instance which is why I need to transfer the derived tables to remote server.)

The raw data are huge (each table with millions to billions of records) and each derived table involves joining 10-20 raw tables (with conditionals of course). I'm using Python/pandas on my server and my current solution is something like:

from sqlalchemy import create_engine

# Connect to Redshift
engine = create_engine(conn_str)
conn = engine.connect().execution_options(stream_results=True) # Enable server-side cursor

# Run query on Redshift and save results to HDF in batches
query = "SELECT ... FROM ... INNER JOIN ... LEFT JOIN ..." # Join multiple tables here
itr = pd.read_sql(query, conn, chunksize=100000)
print('Query executed')
for chunk in itr:
    chunk.to_hdf('test.h5', **kwargs_to_hdf)

However, the itr object takes forever to generate (without an error) although I try to do batches on both the server and client end. When I run the identical query in DataGrip on my local machine, it will finish within 20 minutes.

I have too limited knowledge to find out if there's anything wrong and if there's a better approach given my constraints (e.g., read-only RedShift, no other AWS tools like S3, huge data). Any recommendations?

John Rotenstein
  • 241,921
  • 22
  • 380
  • 470
Socap
  • 11
  • 2
  • When you say "I run the same query in DataGrip on my local machine", is it running against the same volume of data? Are you _sure_ that you need to download the raw data for your particular purpose? Even if you can extract the raw data, is it being constantly updated so you'd then have to figure out how to extract the 'new' data since your last extract? Can't you just run the desired queries on the Redshift instance to obtain your desired results (as opposed to extracting the raw data)? – John Rotenstein May 14 '21 at 22:56
  • @JohnRotenstein Thanks! I edited the question per your comments – Socap May 14 '21 at 23:07
  • You'd probably need to write queries that extract _portions_ of the data using `ORDER BY ... LIMIT ... OFFSET ...`, looping through Offset values. It would be slow due to the overhead of each query, but you could run several in parallel, each grabbing a separate range. I would recommend connecting via `psycopg2` rather than pandas. – John Rotenstein May 14 '21 at 23:21
  • I may be wrong, but given the number of tables and the number of joins, and the data size, it's likely the query is profoundly misusing Redshift, in that the query is absolutely inappropriate for the sorting orders of the tables, and so uses hash joins everywhere, and given the volume of data, massively exceeds the memory available for hash joins and so induces massive amounts of disk I/O. Redshift is able to issue timely SQL on Big Data when and only when it is operated correctly, which is to say, queries are appropriate to the sorting orders of the tables being queried. –  May 14 '21 at 23:38
  • There is nothing here that I haven't seen run fast on Redshift so the question is what is it about your situation that is causing the slowdown. Is the resulting derived tables large? Pulling large amounts of data out of Redshift w/o using S3 can be problematic. How fast does the query run if you put the results into a cursor? The will tell us if the query is slow or extracting the data is slow. – Bill Weiner May 15 '21 at 03:55
  • Anything can run fast, if you pay enough for it. The question is efficiency. If a given technology is inefficient, it will cost you more to obtain the same performance. –  May 15 '21 at 20:45
  • @JohnRotenstein The problem is that each derived table is generated by joining multiple raw tables. I can't expect which portions of the raw tables are matched, so the portion split might not work for my purpose? – Socap May 16 '21 at 06:07
  • If you use `ORDER BY ... LIMIT ... OFFSET ...` and data has not updated in the meantime, then it will work fine for extracting portions of the results. – John Rotenstein May 16 '21 at 12:13

0 Answers0