I'm connected to a read-only Redshift instance from data vendor (through JDBC and no other AWS tools available) that houses a few hundreds of different tables. Now I need to derive a few tables from them and, because the Redshift is read-only, load to my remote server. (Note: I have limited-time access to the Redshift instance which is why I need to transfer the derived tables to remote server.)
The raw data are huge (each table with millions to billions of records) and each derived table involves joining 10-20 raw tables (with conditionals of course). I'm using Python/pandas on my server and my current solution is something like:
from sqlalchemy import create_engine
# Connect to Redshift
engine = create_engine(conn_str)
conn = engine.connect().execution_options(stream_results=True) # Enable server-side cursor
# Run query on Redshift and save results to HDF in batches
query = "SELECT ... FROM ... INNER JOIN ... LEFT JOIN ..." # Join multiple tables here
itr = pd.read_sql(query, conn, chunksize=100000)
print('Query executed')
for chunk in itr:
chunk.to_hdf('test.h5', **kwargs_to_hdf)
However, the itr
object takes forever to generate (without an error) although I try to do batches on both the server and client end. When I run the identical query in DataGrip on my local machine, it will finish within 20 minutes.
I have too limited knowledge to find out if there's anything wrong and if there's a better approach given my constraints (e.g., read-only RedShift, no other AWS tools like S3, huge data). Any recommendations?