0

I am bulk inserting data into crateDB using

pandas.to_sql(tableId, 'crate://xxxxxxx:4200', if_exists='append', index=False, chunksize=20000)

The data contains a few faulty rows where one column of type 'numeric' contains a value of type 'object' / 'string'.

This results in a typecasting error while inserting.

When I set the chunksize parameter of to_sql() to 1, an exception is raised and I can catch the issue.

When chunksize is > 0, to_sql() just continues and I cannot make sure that data was inserted properly.

What am I missing, and what would be a good approach to fix this problem?

some_programmer
  • 3,268
  • 4
  • 24
  • 59
Jabb
  • 3,414
  • 8
  • 35
  • 58

2 Answers2

0

For bulk inserts you need to inspect the result of the query for error handling. It includes an array of rowcounts. A -2 indicates that an error occurred.

For specific errors, likely including typecasting errors, all rows will be rejected (i.e. all rowcounts return -2) in which case you will need to fall back to single inserts.

See https://crate.io/docs/crate/reference/en/4.6/interfaces/http.html#bulk-errors for additional details.

jayeff
  • 1,689
  • 14
  • 14
  • Thanks. This applies for the http REST API. When using to_sql(), None is always returned, nothing to inspect unfortunately. – Jabb Sep 01 '21 at 11:35
0

The problem most likely is that pandas doesn't return anything on to_sql() also see https://github.com/pandas-dev/pandas/issues/23998 and https://github.com/pandas-dev/pandas/issues/42645 . Since the bulk_insert in CrateDB doesn't completely fail, but only skips the faulty rows, no exception is raised, but only an array with a value per row is returned. That return however never is processed in pandas :/

One could change the SQL alchemy implementation of CrateDB, but this would in return break any implementation that actually uses the results array.

proddata
  • 216
  • 1
  • 7