2

I have data from 6 months of emails (email properties like send date, subject line plus recipient details like age, gender etc, altogether around 20 columns) in my teradata table. It comes around 20 million in total, which I want to be brough into Python for further predictive modelling purpose.

I tried to run the selection query using 'pyodbc' connector but it just runs for hours & hours. Then I stopped it & modified the query to fetch just 1 month of data (may be 3-4 million) but still takes a very long time.

Is there any better(faster) option than 'pyodbc' or any different approach altogether ?

Any input is appreciated. thanks

Sreenath1986
  • 167
  • 4
  • 16
  • Even when using ODBC exporting 20 million rows shouldn't run for hours, maybe minutes. Do you write the data to a flat file? Then a TPT job should run < 1 minute. Can you check Teradata's QueryLog for this Select? – dnoeth Jun 26 '18 at 06:46
  • You might check that ODBC Tracing and similar debug functions are disabled on the client side, as these can slow things down massively. – TallTed Jun 28 '18 at 19:45

1 Answers1

3

When communicating between Python and Teradata I recommend to use the Teradata -package (pip teradata; https://developer.teradata.com/tools/reference/teradata-python-module). It leverages ODBC (or REST) to connect.

Beside this you could use JDBC via JayDeBeApi. JDBC could be sometimes some faster then ODBC.

Both options support Python Database API Specification, so that your other code around doesn't have to be touched. E.g. pandas.read_sql works fine with connections from above.

Your performance issues look like some other issues:

  1. network connectivity

  2. Python (Pandas) memory handling

ad 1) throughput can only replaced with more throughput

ad 2) you could try to do as many as possible within the database (feature engineering) + your local machine should have RAM ("pandas rule of thumb: have 5 to 10 times as much RAM as the size of your dataset") - Maybe Apache Arrow can relieve some of your local RAM issues

Check:

hhoeck
  • 361
  • 2
  • 3
  • Thanks @hhoeck. I think lack of enough RAM is causing the problem as the file size is almost half the RAM capacity. I have now optimized the query to extract only the required variables & imports data pretty fast. – Sreenath1986 Jul 02 '18 at 22:59