0

I have a dataset with species name where some names originally used are now obsolete, so they are noted "old_species***retired*** use new_species", whereas correct cells are just noted "new_species". Here is a sample of the data :

df<- data.frame(species=c("Etheostoma spectabile","Ictalurus furcatus","Micropterus salmoides","Micropterus salmoides","Ictalurus punctatus","Ictalurus punctatus","Ictalurus punctatus","Micropterus salmoides","Etheostoma olmstedi","Noturus insignis","Lepomis auritus","Lepomis auritus","Nocomis leptocephalus","Scartomyzon rupiscartes***retired***use Moxostoma rupiscartes","Lepomis cyanellus","Notropis chlorocephalus","Scartomyzon cervinus***retired***use Moxostoma cervinum","Ictalurus punctatus","Lythrurus ardens","Moxostoma pappillosum","Micropterus salmoides","Micropterus salmoides","Ictalurus punctatus"))

I have tried

sapply(strsplit(df$species, split='***retired***use', fixed = T),function(x) (x[2])))

but the cells for which the data is correct returns NA because they do not contain the split. Is there a way to make the split just for the cells actually containing it?

christophe
  • 25
  • 5

2 Answers2

1

You can change the old names to the new names using gsub plus backreference:

gsub(".*\\*\\*\\*retired\\*\\*\\*use\\s(.*)", "\\1", df$species)

# [1] "Etheostoma spectabile"   "Ictalurus furcatus"      "Micropterus salmoides"   "Micropterus salmoides"  
# [5] "Ictalurus punctatus"     "Ictalurus punctatus"     "Ictalurus punctatus"     "Micropterus salmoides"  
# [9] "Etheostoma olmstedi"     "Noturus insignis"        "Lepomis auritus"         "Lepomis auritus"        
# [13] "Nocomis leptocephalus"   "Moxostoma rupiscartes"   "Lepomis cyanellus"       "Notropis chlorocephalus"
# [17] "Moxostoma cervinum"      "Ictalurus punctatus"     "Lythrurus ardens"        "Moxostoma pappillosum"  
# [21] "Micropterus salmoides"   "Micropterus salmoides"   "Ictalurus punctatus" 

Explanation:

.* anything any number of times followed by ...

\\*\\*\\*retired\\*\\*\\*use\\s ... the literal pattern ***retired***use followed by ...

(.*) ... anything any number of times--that's the capturing group that the backreference \\1 in the replacement argument of gsubrefers to

Chris Ruehlemann
  • 20,321
  • 4
  • 12
  • 34
  • Worked flawlessly. I just want to clarify, in this case the double backslash (\\) must be used between every character that could be an operator as well? – christophe Apr 30 '20 at 12:33
  • Yeah, in the R 'dialect', if you will, metacharacters such as `*`, which have special meanings in regex (such as, in the case of `*`, as a quantifier, meaning "zero or more times"), must be escaped by double backslash if it is to be matched literally, that is, for example, an asterisk you wish to match as an asterisk. Glad, the code worked ;) – Chris Ruehlemann Apr 30 '20 at 13:21
0

We can create an index with grep and then split using those rows

i1 <- grep('retired', df$species)
df$species <- as.character(df$species)
df$species[i1] <- sapply(strsplit(df$species[i1], "***retired***use ", 
                fixed = TRUE), `[`, 2)

df$species
#[1] "Etheostoma spectabile"   "Ictalurus furcatus"      "Micropterus salmoides"   "Micropterus salmoides"   "Ictalurus punctatus"    
#[6] "Ictalurus punctatus"     "Ictalurus punctatus"     "Micropterus salmoides"   "Etheostoma olmstedi"     "Noturus insignis"       
#[11] "Lepomis auritus"         "Lepomis auritus"         "Nocomis leptocephalus"   "Moxostoma rupiscartes"   "Lepomis cyanellus"      
#[16] "Notropis chlorocephalus" "Moxostoma cervinum"      "Ictalurus punctatus"     "Lythrurus ardens"        "Moxostoma pappillosum"  
#[21] "Micropterus salmoides"   "Micropterus salmoides"   "Ictalurus punctatus"    

Or by using regex with sub

df$species <- sub(".*\\*{3}retired\\*{3}use\\s+", "", df$species)
akrun
  • 874,273
  • 37
  • 540
  • 662