2

I have a dataframe that contains some cells with error messages as string. The strings come in the following forms:

ERROR-100_Data not found for ID "xxx"
ERROR-100_Data not found for id "xxx"
ERROR-101_Data not found for SUBID "yyy"
Data not found for ID "xxx"
Data not found for id "xxx"

I need to extract the number of the error (if it has one) and the GENERAL description, avoiding the specificity of the ID or SUBID. I have a function where I use the following regex expression:

sub(".*?ERROR-(.*?)for ID.*","\\1",df[,col1],sep="-")

This works only for the first case. Is there a way to obtain the following results using only one expression?

100_Data not found
100_Data not found
101_Data not found
Data not found
Data not found
Aaron Parrilla
  • 522
  • 3
  • 13

3 Answers3

2

We can use:

tsxt <- 'ERROR-100_Data not found for ID "xxx"'
    gsub("\\sfor.*|ERROR-","",tsxt, perl=TRUE)
   [1] "101_Data not found"

Or as suggested by @Jan anchor ERROR to make it more general:

gsub("\\sfor.*|^ERROR-","",tsxt, perl=TRUE)
NelsonGon
  • 13,015
  • 7
  • 27
  • 57
  • 1
    While true, I'd anchor it to the start. Image a string like `xxx ERROR Data not found xxx` – Jan Jul 24 '19 at 06:58
1

You could use

^ERROR-|\sfor.+

which needs to be replaced by an empty string, see a demo on regex101.com.

Jan
  • 42,290
  • 8
  • 54
  • 79
1

Use this regex:

.*?(?:ERROR-)?(.*?)\s+for\s+(?:[A-Z]*)?ID

This makes sure that ERROR- part is optional, then captures everything before for ...ID is encountered (case-insensitively). The only capturing group contains the desired text, which can then be used directly without needing any substitution.

The first and the third groups in this regex are non-capture groups, i.e., they'll match their content but not capture it for further usage, thus leaving us with only one capture group (the middle one). This is done since the OP isn't interested in the data they refer to. Making them as capture groups would have meant three results, and the post-processing would have involved hard-coding the usage of second group only (the middle one), without ever having to deal with the other two.

Demo

CinCout
  • 9,486
  • 12
  • 49
  • 67
  • Ah, finally a regex with a non capture. Could you please explain its usage here? It's something I love but rarely see it in use hence a bit less conversant with it. – NelsonGon Jul 24 '19 at 07:02
  • Yes, just need a bit more explanation on the non capture part. Ah, it actually is already explained well. Thanks, sorry. – NelsonGon Jul 24 '19 at 07:04