-1

I'm trying to read a 20M records tsv in R, but keep getting errors because aparently the file was not generated correctly.

However, the data is always uploaded to Adobe's servers, for a product we use. It gives warning about values not been correctly formated, but nonetheless it gets uploaded and we can use the data.

1.- How can I upload the data to duckdb regardless of the formating issues?

2.- Error in file "xxx,csv" on line 1107386. How can I access only that line (or maybe 5 before and after) to see what is really the problem?

enter image description here

Omar Gonzales
  • 3,806
  • 10
  • 56
  • 120

1 Answers1

1

EDIT: long story short, duckdb_read_csv needs the quote="" just like read.csv does in the previous answer (preserved, below). You may still need to consider some of the reparative steps for removing other quotes that may now be retained by this.

Reprex:

duck <- DBI::dbConnect(duckdb::duckdb())
writeLines(c('a,b,d', '1,"Boticas MiCasa" RSL,3'), "fake.csv")
duckdb::duckdb_read_csv(duck, "faketable", "fake.csv")
# Error: rapi_execute: Failed to run query
# Error: Invalid Input Error: Error in file "/home/r2/fake.csv" on line 2: quote should be followed by end of value, end of row or another quote. (  file=/home/r2/fake.csv
#   delimiter=','
#   quote='"'
#   escape='"' (default)
#   header=1
#   sample_size=20480
#   ignore_errors=0
#   all_varchar=0). 
duckdb::duckdb_read_csv(duck, "faketable", "fake.csv", quote="")
DBI::dbGetQuery(duck, "select * from faketable")
#   a                    b d
# 1 1 "Boticas MiCasa" RSL 3

Just a thought to help troubleshoot, not sure it'll work.

When the error is about quotes, you can turn off quote process and try again. It will result in having quotes all throughout your data, but (1) these are easy to remove, and (2) at least you can see where the problem is.

Unfortunately ... if the data has embedded commas within strings (which would make the use of quotes required), then this won't work.

Demo:

# normal
read.csv(text='a,b,d\n"1",2,3\n"4",5,6')
#   a b d
# 1 1 2 3
# 2 4 5 6
read.csv(text='a,b,d\n"1",2,3\n"4,5,6')
# Error in read.table(file = file, header = header, sep = sep, quote = quote,  : 
#   incomplete final line found by readTableHeader on 'text'

We can disable processing of quotes by adding quote="", and see the problematic row:

read.csv(text='a,b,d\n"1",2,3\n"4,5,6', quote="")
#     a b d
# 1 "1" 2 3
# 2  "4 5 6

We can clean up both the good ("1") and bad ("4) strings with this pipeline:

read.csv(text='a,b,d\n"1",2,3\n"4,5,6', quote="") |>
  lapply(function(z) if (is.character(z)) type.convert(gsub(r"(^['"]|['"]$)", "", z), as.is=TRUE) else z) |>
  as.data.frame()
#   a b d
# 1 1 2 3
# 2 4 5 6

If you have embedded quotes (requiring CSV's double-quoting escape, as in "hello ""world"" again", then you'd need another step in the cleanup to work on those middle quotes.

r2evans
  • 141,215
  • 6
  • 77
  • 149
  • I've found that in the problematic line, there is a value with "". eg. `"Boticas MiCasa" RSL. ` What I don't understand is why "" is causing problems? since the separator is a tab `\t` ? – Omar Gonzales Aug 30 '23 at 19:21
  • @OmarGonzales see my edit. Still `quote=""`. – r2evans Aug 30 '23 at 19:57
  • So, you're saying the embedded quotes are the problem? removing them from source, would be the best option, right? If so, why quotes generate errors, since the separator is tab? – Omar Gonzales Aug 30 '23 at 20:38
  • 1
    The presence or absence or misplacement or error-state with quotes has nothing to do with `sep="\t"` or `sep=","`, that is a red herring. The error is because `duckdb_read_csv` defaults to requiring them to be present for the whole "cell" or none of it, none of that partial stuff. So the fix is (_still_) to tell `duckdb_read_csv` to not look at quotes. I'm not telling you to try to remove them from the source file, though that is an option. I'm telling you to set `quotes=""` and you should be able to read in your data. Then you will need to check to see if you have quotes you don't want. – r2evans Aug 30 '23 at 20:44