How to use Chunk Size for kedro.extras.datasets.pandas.SQLTableDataSet in the kedro pipeline?

Question

I am using kedro.extras.datasets.pandas.SQLTableDataSet and would like to use the chunk_size argument from pandas. However, when running the pipeline, the table gets treated as a generator instead of a pd.dataframe().

How would you use the chunk_size within the pipeline?

My catalog:

table_name:
  type: pandas.SQLTableDataSet
  credentials: redshift
  table_name : rs_table_name
  layer: output
  save_args:
    if_exists: append
    schema: schema.name
    chunk_size: 1000

Napoleon Borntoparty · Answer 1 · 2021-05-18T09:08:11.527

Looking at the latest pandas doc, the actual kwarg to be used is chunksize, not chunk_size. Please see https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_sql.html. Since kedro only wraps your save_args and passes them to pd.DataFrame.to_sql these need to match:

def _save(self, data: pd.DataFrame) -> None:
    try:
        data.to_sql(**self._save_args)
    except ImportError as import_error:
        raise _get_missing_module_error(import_error) from import_error
    except NoSuchModuleError as exc:
        raise _get_sql_alchemy_missing_error() from exc

EDIT: Once you have this working in your pipeline, the docs show that pandas.DataFrame.read_sql with chunksize set will return type Iterator[DataFrame]. This means that in your node function, you should iterate over the input (and annotate accordingly, if appropriate) such as:

def my_node_func(input_dfs: Iterator[pd.DataFrame], *args):
  for df in input_dfs:
    ...

This works for the latest version of pandas. I have noticed, however, that pandas is aligning the API so that read_csv with chunksize set returns a ContextManager from pandas>=1.2 so I would expect this change to occur in read_sql as well.

How to use Chunk Size for kedro.extras.datasets.pandas.SQLTableDataSet in the kedro pipeline?

1 Answers1