Duckdb_read_csv struggling with with auto detecting column data types in R

Question

I have some very large CSV files (~183mio. rows by 8 columns) that I want to load into a database using R. I use duckdb for this and it its built-in function duckdb_read_csv, which is supposed to auto-detect datatypes for each column. If I enter the following code:

con = dbConnect(duckdb::duckdb(), dbdir="testdata.duckdb", read_only = FALSE)

duckdb_read_csv(con, "d15072021","mydata.csv",
                header = TRUE)

It produces this error:

Error: rapi_execute: Failed to run query
Error: Invalid Input Error: Could not convert string '2' to BOOL between line 12492801 and 12493825 in column 9. Parser options: DELIMITER=',', QUOTE='"', ESCAPE='"' (default), HEADER=1, SAMPLE_SIZE=10240, IGNORE_ERRORS=0, ALL_VARCHAR=0

I've looked at the rows in question and I can't find any irregularities in column 9. Unfortunately, I cannot post the dataset because it's confidential. But the entire column is filled with either FALSE or TRUE.

If I set the parameter nrow.check to something larger than 12493825 it doesn't produce the same error but takes very long and simply converts the column to VARCHAR instead of a logical. Setting nrow.check to -1 (meaning it checks every row for a pattern) crashes R and my PC completely.

The weird thing: This isn't consistent. Earlier I imported the dataset whilst keeping the default value for nrow.check at 500 and it read the file with no issue (though still converting column 9 to VARCHAR). I have to read a lot of files that are the same pattern so I need to have a reliable way of reading them. Anyone know how duckdb_read_csv actually works and why I might get this error?

Note that reading the files into memory and then into a database isn't an option because I run out of memory instantly.

You could use the CSV Lint plug-in in Notepad++ to validate the data, see if there are any obvious data errors. https://github.com/BdR76/CSVLint The plug-in also has a menu item Analyse Data Report to quickly check each column for unexpected codes, datatypes, out of range, weird data etc. — BdR, Nov 15 '22 at 16:05
Greetings! Usually it is helpful to provide a minimally reproducible dataset for questions here so people can troubleshoot your problems. One way of doing this is by using the `dput` function. In your case since you have a large dataset, you could also provide either the `head` of your data in the `dput` or use some other example data. You can find out how to use `dput` here btw: https://youtu.be/3EID3P1oisg — Shawn Hemelstrand, Nov 16 '22 at 06:35
You can also use `colClasses` as argument to `duckdb_read_csv`. — Karsten W., Jul 19 '23 at 15:55

score 0 · Answer 1 · answered Nov 15 '22 at 12:55

the way the sniffer works is by sampling nrow.check rows to figure out the data type, so the result can differ from runs if you get unlucky, increasing it will reduce the chances of failing it, mainly because the sniffer looks at more rows. If increasing the number of rows is not possible due to performance issues, you can of course first define the schema of the CSV file. But then you must know the schema beforehand.

As an example of how you can define the schema and turn off the sniffer: select * from

SELECT * FROM read_csv('test.csv',  COLUMNS=STRUCT_PACK(a := 'INTEGER', b := 'INTEGER'), auto_detect='false')

Duckdb_read_csv struggling with with auto detecting column data types in R

1 Answers1