I have some very large CSV files (~183mio. rows by 8 columns) that I want to load into a database using R. I use duckdb for this and it its built-in function duckdb_read_csv, which is supposed to auto-detect datatypes for each column. If I enter the following code:
con = dbConnect(duckdb::duckdb(), dbdir="testdata.duckdb", read_only = FALSE)
duckdb_read_csv(con, "d15072021","mydata.csv",
header = TRUE)
It produces this error:
Error: rapi_execute: Failed to run query
Error: Invalid Input Error: Could not convert string '2' to BOOL between line 12492801 and 12493825 in column 9. Parser options: DELIMITER=',', QUOTE='"', ESCAPE='"' (default), HEADER=1, SAMPLE_SIZE=10240, IGNORE_ERRORS=0, ALL_VARCHAR=0
I've looked at the rows in question and I can't find any irregularities in column 9. Unfortunately, I cannot post the dataset because it's confidential. But the entire column is filled with either FALSE or TRUE.
If I set the parameter nrow.check to something larger than 12493825 it doesn't produce the same error but takes very long and simply converts the column to VARCHAR instead of a logical. Setting nrow.check to -1 (meaning it checks every row for a pattern) crashes R and my PC completely.
The weird thing: This isn't consistent. Earlier I imported the dataset whilst keeping the default value for nrow.check at 500 and it read the file with no issue (though still converting column 9 to VARCHAR). I have to read a lot of files that are the same pattern so I need to have a reliable way of reading them. Anyone know how duckdb_read_csv actually works and why I might get this error?
Note that reading the files into memory and then into a database isn't an option because I run out of memory instantly.