2

I have taken the iris data.frame and then filtered out the "setosa" from the species.

when i do tapply(), it gives me a summary of all 3 things that were originally in that column. why does it show me setosa as NA. It shouldnt know about setosa!!!

library(dplyr)
a <-filter(iris, Species != "setosa")

tapply(a$Sepal.Length, a$Species, mean)

Result:

    tapply(a$Sepal.Length, a$Species, mean)   
 #  setosa versicolor  virginica    
 #      NA      5.936      6.588

what am i missing?

LocoGris
  • 4,432
  • 3
  • 15
  • 30
  • 5
    most likely its because Species is a factor. Even if you remove all cases within one factor-realisation, the factorvariable still knows it has 3 different realisations. droplevels() should solve this hopefully – TinglTanglBob Mar 28 '19 at 13:34
  • 2
    @TinglTanglBob is right, this would get rid of it : `tapply(a$Sepal.Length,as.character( a$Species), mean)` – Mike Mar 28 '19 at 13:34

1 Answers1

3

That's because in your filtered dataframe, the column Species is still a factor with 3 levels, even if there are only 2 of them in the column. You can use droplevels to drop the unused levels:

library(dplyr) 
a <- droplevels(filter(iris, Species != "setosa"))
tapply(a$Sepal.Length, a$Species, mean)
# versicolor  virginica 
#      5.936      6.588 
Stéphane Laurent
  • 75,186
  • 15
  • 119
  • 225
  • No need for dplyr, use `subset` or `[`. – zx8754 Mar 28 '19 at 13:50
  • 1
    Regarding the approach of OP i think using dplyr here is the best way to keep the answer as close as possible to the question. Using other subset methods might cause unintended confusion i fear. – TinglTanglBob Mar 28 '19 at 13:57