5

I have a data frame that I'm using called "fish".

The data frame has 3 different variables. One of the variables is called "species".

There are some species that start with the letter M. I want to change all the values of species that start with the letter M to be missing (NA) instead.

I know how to change it to NA when you are doing the whole species name, but how do you do it for just species that START with the letter M?

I've tried this:

fish$species[fish$species=="^M_"] <- NA

But this doesn't work. Can anyone help?

newtoallthis
  • 53
  • 1
  • 5
  • To test for matching a pattern, you'll need the `grepl` function, not `==`. – Frank Dec 09 '16 at 18:46
  • Thanks, I have seen stuff out there using gsub and grep. But can you help me with the code? Do I literally replace the == with "grep1"? – newtoallthis Dec 09 '16 at 18:47
  • Ah, I forgot that R had added the `startsWith` function (in the answer below), but the use of grepl is covered in the docs at `?grepl`. You'd do something like `x[ grepl(patt, x) ] <- y`, generally. `grep` can also be used here, thanks to R's multiple ways of indexing a vector (by logical or by position number, covered in any R intro tutorial). – Frank Dec 09 '16 at 18:49
  • Not to be a total dummy, but I don't really understand any of your comment. I'm pretty new to R, I've only been learning it for about a month. In your code, what is the x and patt? – newtoallthis Dec 09 '16 at 18:55
  • You can reach the documentation for a function by typing `?` then the function's name. The style `x[ w ] <- y` that you're using here can work with `w` coming from `grep` or from `grepl`. Not sure if that covers it. – Frank Dec 09 '16 at 18:57

1 Answers1

7

You could use the replacement function is.na<-() along with startsWith().

is.na(fish$species) <- startsWith(fish$species, "M")

According to the R documentation help(startsWith),

startsWith() is equivalent to but much faster than grepl("^<prefix>", x), where prefix is not to contain special regular expression characters.

The code above assumes a character column. For a factor column, you can change the appropriate levels.

is.na(levels(fish$species)) <- startsWith(levels(fish$species), "M")

Another way would be to replace with levels<-(), using NA for the replacement on the right-hand-side.

levels(fish$species)[startsWith(levels(fish$species), "M")] <- NA

Note that you can definitely use grepl() if you'd like, but this question seems like a good example use of the new startsWith() function.

Also note that all these were successfully tested on the iris data set.

Rich Scriven
  • 97,041
  • 11
  • 181
  • 245
  • It worked, it worked!!!!!!! So in order to do this, I had to change the variable from a factor to a character. It's not a huge deal, but is there any way to keep it as a factor? It gives me an error (non-character object) if I leave as factor. – newtoallthis Dec 09 '16 at 18:51
  • 1
    Should be able to do it with `is.na(fish$species) <- startsWith(as.character(fish$species), "M")`. That will not change the vector to character. – IRTFM Dec 09 '16 at 18:53
  • @newtoallthis - I just noticed that too. Made an edit. – Rich Scriven Dec 09 '16 at 18:56
  • 1
    There was a time that I tried using `levels<-`, but I found it too dangerous for general practice at least in my hands. – IRTFM Dec 09 '16 at 19:00
  • Not sure that this is any better than `grepl` anyhow, but this question seemed like a good example of when to use the new `startsWith` function. – Rich Scriven Dec 09 '16 at 19:00
  • Agree. `startsWith` was new to me. The `levels<-` operation seems to succeed with: `f <- factor(letters[1:10]); levels(f)[grepl("[a-c]", levels(f))] <- NA`. – IRTFM Dec 09 '16 at 19:04