1

Given a ; delimited file of structure:

colA; colB; colC
1;A; 10
2;B; 11     
3;C"; 12
4;D""; 15
5;"F";20
6;K"""; 21
7;""M";22
8; \""O;23

I would like to ensure that colB is always imported verbatim as a character string. In particular, I would like to preserve all values including ""M" and \""O.

Attempt

I'm currently trying:

require(readr)
tst_dta <- read_delim(
  file = "test_file.csv",
  escape_double = FALSE,
  delim = ";",
  col_types = cols(
    colA = col_integer(),
    colB = col_character(),
    colC = col_integer()
  )
)

but this returns:

> tst_dta
# A tibble: 8 x 3
   colA colB        colC
  <int> <chr>      <int>
1     1 A             10
2     2 B             NA
3     3 "C\""         12
4     4 "D\"\""       15
5     5 F             20
6     6 "K\"\"\""     21
7     7 "\"\"M\""     22
8     8 " \\\"\"O"    23

Desired rsults

The desired results should reflect:

    colA colB  colC
   <int> <chr> <int>
    1     A     10
    2     B     11     
    3     C"    12
    4     D""   15
    5    "F"    20
    6     K"""  21
    7   ""M"    22
    8  \""O     23

Other points:

  • Ideally, I would also like to ensure that non-ASCII characters are ignored in a manner that value \""[Non-ASCII-Character]Owould appear in the resulting data frame as \""O string.

Updates

As per comments, more examples:

is:

colA; colB; colC
1; text \" text; 2

should be:

colA;colB;colC
1;text text;2

is:

colA; colB; colC
1; text \;" text; 2

should be:

colA;colB;colC
1;text text;2

is:

colA; colB; colC
1; [non-ASCII] text something \;" text; 2

should be:

colA;colB;colC
1;text something;2
Konrad
  • 17,740
  • 16
  • 106
  • 167
  • 1
    not possible, or reasonable in my opinion. – Andre Elrico Jun 21 '18 at 14:13
  • @AndreElrico Broadly speaking, I would tend to agree on the *reasonable* point, but expectedly, I'm doing this out of necessity not out of choice. Happy to consider answer that will get me as close as possible to the desired results, i.e maintain column structure, drop non-ASCII characters and salvage rest of the data to greatest possible extent. – Konrad Jun 21 '18 at 14:15
  • I'm not sure how you can have `"` inside a character string without escaping. Why do you need it without escapement? – Andre Elrico Jun 21 '18 at 14:19
  • 1
    count me as interessted in a solution! – Andre Elrico Jun 21 '18 at 14:20
  • @AndreElrico You mean in the input file? The file appeared to "suffer" some manual transformations with odd `"` and non-ASCII characters inserted on random. In a word, the key goal is to preserve as much file as possible. – Konrad Jun 21 '18 at 14:23
  • can you post real data examples where something "wrong" happened and what it really should be instead? – Andre Elrico Jun 21 '18 at 14:26
  • @AndreElrico Updated with some more examples – Konrad Jun 21 '18 at 14:30
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/173567/discussion-between-andre-elrico-and-konrad). – Andre Elrico Jun 21 '18 at 14:31

1 Answers1

1

If you need to use readr-functions, then look at it's argument list and see if it has an equivalent to the quote argument in read.table (which allows simple access:

 read.table(text=txt, header=TRUE, quote="", sep=";")
  colA colB colC
1    1    A   10
2    2    B   11
3    3   C"   12
4    4  D""   15
5    5  "F"   20
6    6 K"""   21
7    7 ""M"   22
8    8  ""O   23

Seems like it should succeed, since it's the third argument in readr::read_delim. The default in both cases is "\"" which is a single double-quote. Set it to an empty character (""):

Usage

read_delim(file, delim, quote = "\"", escape_backslash = FALSE,
  escape_double = TRUE, col_names = TRUE, col_types = NULL,
  locale = default_locale(), na = c("", "NA"), quoted_na = TRUE,
  comment = "", trim_ws = FALSE, skip = 0, n_max = Inf,
  guess_max = min(1000, n_max), progress = show_progress())

And this is the print representation of the result. I would note that this print representation seems bit irregular. Character values are enclosed in double quotes only if they have embedded double quotes, i.e \". On the other hand such columns are character which is a nice change from the default settings in read.table which give you factor columns:

read_delim(file=txt,  quote="", delim=";")
# A tibble: 8 x 3
   colA ` colB`   ` colC`   
  <int> <chr>     <chr>     
1     1 A         " 10"     
2     2 B         " 11     "
3     3 "C\""     " 12"     
4     4 "D\"\""   " 15"     
5     5 "\"F\""   20        
6     6 "K\"\"\"" " 21"     
7     7 "\"\"M\"" 22        
8     8 " \"\"O"  23 

You are hereby warned that using this option with read_delim does mean that neither column names nor values are trimmed to remove whitespace. And everything is character, even the columns that would otherwise come in as character. Notice the name of your second column. That does not occur with read.table:

read_delim(file=txt,  quote="", delim=";")$` colB` == 
         read.table(text=txt, header=TRUE, quote="", sep=";")$colB
[1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

Further gsub-processing would be needed if you wanted leading or trailing whitespace to be removed. rm_non_ascii in pkg {qdapRegex} can remove non-ASCII characters

Community
  • 1
  • 1
IRTFM
  • 258,963
  • 21
  • 364
  • 487