2

I have a large csv file that I'd like to read with arrow::read_csv_arrow(). However, the file contains quoted strings. readr::read_delim() is able to read the file (given correct settings), while arrow::read_csv_arrow() is not:

library(arrow)
library(readr)

# create offending file
x <- tempfile()
write_lines(
'
id,text
1,Some interesting text
2,"Some text on: \"how to break arrow\" by X, and Y"
', x)

read_delim(x, delim = ",", escape_double = FALSE, escape_backslash = TRUE)
#> Rows: 2 Columns: 2
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (1): text
#> dbl (1): id
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 2 × 2
#>      id text                                              
#>   <dbl> <chr>                                             
#> 1     1 "Some interesting text"                           
#> 2     2 "Some text on: \"how to break arrow\" by X, and Y"

read_csv_arrow(x, escape_double = FALSE, escape_backslash = TRUE)
#> Error:
#> ! Invalid: CSV parse error: Row #3: Expected 2 columns, got 3: 2,"Some text on: "how to break arrow" by X, and Y"

Created on 2022-10-13 with reprex v2.0.2

I have tried various settings on the parser, to no avail, such as:

read_csv_arrow(x, parse_options = CsvParseOptions$create(double_quote = FALSE, escaping = TRUE))
Thomas K
  • 3,242
  • 15
  • 29
  • I can reproduce (arrow 9.0.0.1 and 10.0.0.0) – mdag02 Nov 01 '22 at 21:51
  • The quotes aren't escaped in the file when written this way; I'm surprised {readr} works – alistaire Nov 01 '22 at 23:22
  • I think it would be helpful to others if you posted the exact CSV you need to parse. Your snippet that writes to a temporary file produces a malformed CSV. – amoeba Nov 02 '22 at 00:49
  • 1
    @amoeba the file was written with sparklyr::spark_write_csv, see my comment on https://issues.apache.org/jira/browse/ARROW-18219. Don't have access to the file atm, but the structure displayed above is exactly as in the file (except for the newlines before and after the three data lines). – Thomas K Nov 03 '22 at 13:46

0 Answers0