2

I am parsing long strings with semicolons and quotes using R v4.0.0 and stringi. Here is an example string:

tstr1 <- 'gene_id "APE_RS08740"; transcript_id "unassigned_transcript_1756"; gbkey "CDS"; inference "COORDINATES: protein motif:HMM:NF014037.1"; locus_tag "APE_RS08740"; note "incomplete; partial in the middle of a contig; missing N-terminus"; partial "true"; product "DUF5615 family PIN-like protein"; pseudo "true"; transl_table "11"; exon_number "1"'

I would like to extract a quoted substring by first matching a variable pattern var and then extracting everything until the next semicolon. I would like to avoid matching instances of var that are within quoted substrings. So far, I have this:

library(stringi)
library(dplyr)
var <- "partial"
str_extract(string = tstr1, pattern = paste0('"; ', var, '[^;]+')) %>%
    gsub(paste0("\"; ", var), "", .) %>%
    gsub("\"", "", .) %>% trimws()

This returns "true", which is my desired output. However, I need a regex that also works in two edge cases:

Case 1

When var is at the beginning of the string and I can't rely on a preceding "; to match.

tstr2 <- 'partial "true"; gene_id "APE_RS08740"; transcript_id "unassigned_transcript_1756"; gbkey "CDS"; infernce "COORDINATES: protein motif:HMM:NF014037.1"; locus_tag "APE_RS08740"; note "incomplete; partial in the middle of a contig; missing N-terminus"; product "DUF5615 family PIN-like protein"; pseudo "true"; transl_table "11"; exon_number "1"'

Expected output: "true"

Case 2

When the quoted substring to be extracted contains a semicolon, I would want to match everything until the next semicolon that is not within the quoted substring.

tstr3 <- 'partial "true; foo"; gene_id "APE_RS08740"; transcript_id "unassigned_transcript_1756"; gbkey "CDS"; infernce "COORDINATES: protein motif:HMM:NF014037.1"; locus_tag "APE_RS08740"; note "incomplete; partial in the middle of a contig; missing N-terminus"; product "DUF5615 family PIN-like protein"; pseudo "true"; transl_table "11"; exon_number "1"'

Expected output: "true; foo"

acvill
  • 395
  • 7
  • 15

1 Answers1

3

We may use an OR (|) condition for cases where the 'partial' doesn't have any preceding " or ;, and then extract the characters between the two "

library(stringr)
str_extract(tstr, sprintf('";\\s+%1$s[^;]+|^%1$s[^;]+;[^"]+"', var)) %>% 
     trimws(whitespace = '["; ]+', which = 'left') %>% 
      str_extract('(?<=")[^"]+(?=")')

-output

[1] "true"      "true"      "true; foo"

data

tstr <- c(tstr1, tstr2, tstr3)
akrun
  • 874,273
  • 37
  • 540
  • 662
  • +1, This works great for the provided example, but could you edit your answer to make it generalizable to `var` and not the specific instance `var <- "partial"`? – acvill Nov 09 '21 at 16:07
  • 1
    @acvill added with `sprintf` to insert the 'var' – akrun Nov 09 '21 at 16:10