3

I am reading a few thousand csv files where some columns have a very 'interesting' format: {""Q0"":""double double quote""}

It seems read.csv reads it fine, but both read_csv and fread are doing different things (see below). My expectation would be to have: {"Q0":"double double quote"}

Is this a bug or am I doing something wrong?

# Content of csv file
# "numbers", "simple_quote", "double_quote"
# "9", "quoted text", "{""Q0"":""double double quote""}"

library(readr)  
library(data.table)
  
read.csv("test.csv")
#>   numbers simple_quote                  double_quote
#> 1       9  quoted text  {"Q0":"double double quote"}

read_csv("test.csv")
#> # A tibble: 1 x 3
#>   numbers simple_quote double_quote                      
#>     <dbl> <chr>        <chr>                             
#> 1       9 quoted text  "{\"Q0\":\"double double quote\"}"

fread("test.csv")
#>    numbers simple_quote                     double_quote
#> 1:       9  quoted text {""Q0"":""double double quote""}

Created on 2021-04-09 by the reprex package (v2.0.0)

Gorka
  • 3,555
  • 1
  • 31
  • 37
  • 1
    Do you want to keep the double quotes in the resulting string? Or do you want to save the characters that we within the double quotes? – Andrew Brēza Apr 09 '21 at 18:15
  • 2
    `read_csv` is actually returning basically the the same data as `read.csv` if you strip the whitespace after the comma: `read.csv("quotes.txt")$double_quote[1] == readr::read_csv("quotes.txt")$double_quote[1]`. I'm not sure which value you think is "correct" in this case. There are a lot of weird CSV files out there. There's on open issue at data.table for this difference: https://github.com/Rdatatable/data.table/issues/4779 – MrFlick Apr 09 '21 at 19:20
  • Thanks for your answers! My expectation is to get: `{"Q0":"double double quote"}`. – Gorka Apr 10 '21 at 06:44
  • 2
    I suggest that since `read.csv` and `readr::read_csv` are doing it correctly, and `data.table::fread` has a known bug with this, then the quick answer to your question is: **Yes, this is a bug**. I can think of no easy workaround to continue using `fread` (in R or shell-scripting), so I don't know if this question can be shifted into *"how to use `fread` here"*, I think you're stuck using one of the other two for now. (Until https://github.com/Rdatatable/data.table/issues/1109 and github.com/Rdatatable/data.table/issues/4779 are resolved, that is. Note that `#1109` was filed in 2015.) – r2evans Apr 10 '21 at 17:01
  • 2
    Thanks r2evans. The speed benefit of fread makes it 'unavoidable' in this case, so I opted for taking care of the double double quotes with gsub(). – Gorka Apr 10 '21 at 19:28
  • 1
    @Gorka Could you please show your workaround using `gsub()` as an answer to this question? – Jens Piegsa Aug 05 '21 at 11:04

1 Answers1

0

Just to consolidate @MrFlick and @r2evans comments in a single response, and include my final workaround as @jens-piegsa suggested.

It seems to be a bug, at least on the data.table front:

In my use case, the speed advantage of data.table is huge, so I added a gsub() step to take care of the double quotes.

DF %>% mutate(stimulus = gsub('\\{""Q0"":""|""\\}', '', stimulus))
Gorka
  • 3,555
  • 1
  • 31
  • 37