0

The pandas equivalent code for connecting to Teradata, I have used is:

database = config.get('Teradata connection', 'database')
host = config.get('Teradata connection', 'host')
user = config.get('Teradata connection', 'user')
pwd = config.get('Teradata connection', 'pwd')

with teradatasql.connect(host=host, user=user, password=pwd) as connect:
    query1 = "SELECT * FROM {}.{}".format(database, tables)
    df = pd.read_sql_query(query1, connect)

Now, I need to use the Dask library for loading big data as an alternative to pandas.

Please suggest a method to connect the same with Teradata.

krx
  • 85
  • 1
  • 7

1 Answers1

0

Teradata appears to have a sqlalchemy engine, so you should be able to install that, set your connection string appropriately and use Dask's existing from_sql function.

Alternatively, you could do this by hand: you need to decide on a set of conditions which will partition the data for you, each partition being small enough for your workers to handle. Then you can make a set of partitions and combine into a dataframe as follows

def get_part(condition):
    with teradatasql.connect(host=host, user=user, password=pwd) as connect:
        query1 = "SELECT * FROM {}.{} WHERE {}".format(database, tables, condition)
        return pd.read_sql_query(query1, connect)

parts = [dask.delayed(get_part)(cond) for cond in conditions)
df = dd.from_delayed(parts)

(ideally, you can derive the meta= parameter for from_delayed beforehand, perhaps by getting the first 10 rows of the original query).

mdurant
  • 27,272
  • 5
  • 45
  • 74