0

I have a vector of strings and I would like to parse it. However, the brackets in combinations with quotes make this quite complicated. I would like to solve this preferably with stringr (not a requirement)

x = c("[\"DER001_A375_96H:TRCN0000052583:-666\"]", "[\"TRCN0000052583\"]", "[\"AAK1\",\"AARS\"]", "[\"A375\"]", "-6.7389873 ... 4.6063291") 

> x
[1] "[\"DER001_A375_96H:TRCN0000052583:-666\"]" "[\"TRCN0000052583\"]"                     
[3] "[\"AAK1\",\"AARS\"]"                       "[\"A375\"]"                               
[5] "-6.7389873 ... 4.6063291"    

Expected result:

DER001_A375_96H:TRCN0000052583:-666
TRCN0000052583
AAK1
AARS
A375
6.7389873
4.6063291
MrNetherlands
  • 920
  • 7
  • 14
  • Why does the data look like this? Did it start off as JSON data at one point? Are you sure you can't generate cleaner data further upstream in your pipeline? This seems like a real mess. – MrFlick Jun 07 '19 at 14:59
  • It is an output from Shiny which I cannot change. See here: https://stackoverflow.com/questions/52858889/extract-filters-from-r-shiny-datatable/ – MrNetherlands Jun 07 '19 at 15:13

2 Answers2

2

Replace each occurrence of ... with comma and remove all occurrences of square brackets. (Note that the [...] defines a character class and if the first character in the class is ] then it is regarded as part of the class and is not regarded to be the terminating ].) Finally, read it in using scan. No packages are used.

scan(text = gsub('[][]', '', gsub(" ... ", ",", x, fixed = TRUE)), 
  sep = ",", what = "", quiet = TRUE)

giving:

[1] "DER001_A375_96H:TRCN0000052583:-666" "TRCN0000052583"                     
[3] "AAK1"                                "AARS"                               
[5] "A375"                                "-6.7389873"                         
[7] "4.6063291"                     
G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
1

With help of SO (for parsing string) and http://edrub.in/CheatSheets/cheatSheetStringr.pdf :

x = c("[\"DER001_A375_96H:TRCN0000052583:-666\"]", 
      "[\"TRCN0000052583\"]", "[\"AAK1\",\"AARS\"]", 
      "[\"A375\"]", "-6.7389873 ... 4.6063291") 
library("dplyr", quietly = TRUE, warn.conflicts = FALSE)
x1 <- x %>% 
        stringr::str_remove_all(pattern = "\"" ) %>% 
        stringr::str_remove_all(pattern = "\\[" ) %>% 
        stringr::str_remove_all(pattern = "\\]" )

x2 <- unlist ( strsplit(x1, split = ",") )
x3 <- unlist ( strsplit(x2, split = "\\.\\.\\.") )
x3
#> [1] "DER001_A375_96H:TRCN0000052583:-666"
#> [2] "TRCN0000052583"                     
#> [3] "AAK1"                               
#> [4] "AARS"                               
#> [5] "A375"                               
#> [6] "-6.7389873 "                        
#> [7] " 4.6063291"

Created on 2019-06-07 by the reprex package (v0.2.1)

cbo
  • 1,664
  • 1
  • 12
  • 27