CSV reader behavior

Question

I like this feature : CREATE VIEW ontime AS SELECT * FROM read_csv_auto('test.csv'); but noticed it does not properly identify type of columns loading big csv files and fails reading rows that do not fit into identified column type. What is default number of rows used by csv reader to decide on column types? Is it possible to adjust number of these rows?

Could you maybe be a bit more specific? As in, which type you expect and which one it selected? Also, which exact data file did you use? From your question, it looks like you used the ontime BTS dataset? — Hannes Mühleisen, Aug 04 '20 at 18:28
I have scv file where a column has bigint values coming after int type values starting at row 1900 .Source has this column defined as bigint. Once created view works till it hits these bigint values in csv and fails with next error: Error: Conversion: Could not convert string '300000011250328' to numeric. describe view in duckdb shows this column as int while in fact it is expected to be bigint. — Alex O, Aug 06 '20 at 01:35

tdoehmen · Answer 1 · 2020-08-04T23:53:37.250

What is default number of rows used by csv reader to decide on column types?

The current behavior is that 10 chunks of 100 rows each are sampled. It can be further broken down into two scenarios.

File has ~ 1000 rows or less (or is compressed): chunks are sampled without gaps in-between (i.e. the first max. 1000 rows are taken into consideration)
File has >> 1000 rows: first 100 rows are sampled, 9 more equidistant chunks of 100 rows are sampled from across the file.

In both cases, the first 100 rows will create an initial type guess which will be checked for consistency with the subsequent chunks. If the types do not match, the guess is revised accordingly.

There is a theoretical (and practical) chance that the type guess turns out wrong nevertheless. Let's say a file with 1.000.000 rows has a column which happens to contain 1000 string values and 999.000 integer values. Assuming the string values are randomly distributed, there is a ~36.7% chance that the sample chunks will not contain ANY string value ((1-(1000/1000000))^1000=0.367..). In that case the type guesser would suggest integer typing for that column and the parser will choke on the first string value. It would be sensible to revise the guess at this point (as e.g. the h2oai/datatable implementation is doing it in datatable/reader_fread.cc). But this behavior is as of now (v.0.2) not implemented in duckdb.

The recommended workaround is to use the duckdb read_csv function and define the schema/datatype manually. See the COPY Statement section here: DuckDB Docs - CSV Loading.

Is it possible to adjust number of these rows?

The number of rows per chunk are defined in the source, see duckdb/buffered_csv_reader.hpp, line 15. If you want to change the number, you would have to build duckdb from source (which is actually quite straight forward). The chunk size can not be bigger than the standard vector size, though, which is 1024 by default. Nevertheless, I think it would be feasible to expose the chunk size & number of sample chunks as option to the user and we could consider doing something like this in a future release. Another option would be to automatically set the number of sample chunks relative to the total number of rows in the file. Thanks for the question.

There is an open issue now on that topic https://github.com/cwida/duckdb/issues/811 — tdoehmen, Aug 06 '20 at 09:48

CSV reader behavior

1 Answers1