2

I'm new to R and I was wondering if there is an opposite code of 'which' in R?

So e.g. when I run the code below, then it keeps all the data between 10 and 50 and removes everything else. The code below works for me, there is no problem there.

data <- data[which(data$age>10 & data$age<50),]

But I want to know if there is a code that can do the opposite? Meaning --> I want to remove specific rows from the data, so instead of having a code that says what to keep I want a code that indicates what to remove. If that makes sense? I want to remove a specific row by condition.

I have tried with the subset code, but I can't get it to work. The below code is the code I tried that didn't work

data2 <- subset(data1, data1$gender=='male')

So gender is a column, with females and males. And I want a code to remove the males only.

SAphi11
  • 29
  • 1
  • 7

1 Answers1

5

Although this should be a comment, since you said you are new to R let me take a bit more space to explain this in a non-technical way as it is clear there is some confusion.

First, when you index in R using brackets (ie, df[x,y]), the x part (before the comma) looks at rows, and the y part looks at columns. Your question title asks about removing columns, but your question asks about removing rows. So I will go through both.

Say you have these data (note some have missing (NA) values):

set.seed(123)
n <- 10
df <- data.frame(Age = sample(c(1:100, NA), n, replace = TRUE),
                 Gender = sample(c("Male", "Female", NA), n, replace = TRUE),
                 Cofactor = rep(LETTERS, length.out = n),
                 Cofactor2 = sample(c("Yes", "No", "Maybe", NA), n, replace = TRUE),
                 Cofactor3 = runif(n))

#    Age Gender Cofactor Cofactor2  Cofactor3
# 1   31 Female        A       Yes 0.02461368
# 2   79   Male        B     Maybe 0.47779597
# 3   51 Female        C      <NA> 0.75845954
# 4   14   <NA>        D        No 0.21640794
# 5   67   Male        E     Maybe 0.31818101
# 6   42   <NA>        F        No 0.23162579
# 7   50   <NA>        G       Yes 0.14280002
# 8   43   Male        H        No 0.41454634
# 9   NA   Male        I     Maybe 0.41372433
# 10  14   Male        J      <NA> 0.36884545

Dropping Rows

You can index rows by row position using numbers - i.e. if you want to keep or drop the first three rows:

# keep 
df[1:3, ]

# drop
df[-c(1:3),]

Notice commands are in the x indexing position (left of the comma). If you wanted to drop the observations (rows) that were male, you could do it several ways. For instance:

df[!(df$Gender %in% "Male"),]

# or using `which()`
df[-(which(df$Gender %in% "Male")),]

#   Age Gender Cofactor1  Cofactor2
# 1  31 Female       Yes 0.02461368
# 3  51 Female      <NA> 0.75845954
# 4  14   <NA>        No 0.21640794
# 6  42   <NA>        No 0.23162579
# 7  50   <NA>       Yes 0.14280002

The ! means "not" - so this reads, "select rows that are not male" - including NA values.

If you did this:

df[df$Gender %in% "Female",]

# or 
df[which(df$Gender %in% "Female"),]

#   Age Gender Cofactor1  Cofactor2
# 1  31 Female       Yes 0.02461368
# 3  51 Female      <NA> 0.75845954

That would read "include all where gender is female" - notice NA != female so they are not included.

Similarly, if you wanted to include both "yes" and "maybe" in Cofactor1:

df[df$Cofactor1 %in% c("Yes", "Maybe"),]

#   Age Gender Cofactor1  Cofactor2
# 1  31 Female       Yes 0.02461368
# 2  79   Male     Maybe 0.47779597
# 5  67   Male     Maybe 0.31818101
# 7  50   <NA>       Yes 0.14280002
# 9  NA   Male     Maybe 0.41372433

Note that I am using %in%, not ==, this is because of vector recycling - see what happens when I use == (hint, it gives unwanted results):

df[df$Cofactor1 == c("Yes", "Maybe"),]

#     Age Gender Cofactor1  Cofactor2
#1     31 Female       Yes 0.02461368
#2     79   Male     Maybe 0.47779597
#NA    NA   <NA>      <NA>         NA
#7     50   <NA>       Yes 0.14280002
#NA.1  NA   <NA>      <NA>         NA

The correct way to use == is much more verbose (df[(df$Cofactor1 == "Yes"| df$Cofactor1 == "Maybe") & !is.na(df$Cofactor1),] so using %in% is a good option here.

Keeping/Dropping Columns

Indexing columns is on the y position of indexing (to the right of the comma). If your data have a large number of unneeded columns, you can simply choose the ones you want to keep by indexing by name (or column number:

df[,c("Age", "Gender")]

# or df[, 1:2]

#    Age Gender
# 1   31 Female
# 2   79   Male
# 3   51 Female
# 4   14   <NA>
# 5   67   Male
# 6   42   <NA>
# 7   50   <NA>
# 8   43   Male
# 9   NA   Male
# 10  14   Male

But you can only drop columns by number (I know, quirky) - so you cant drop by df[,-c("Age", "Gender")] but you can drop by df[,-c(1:2)]

In my work it is preferred to drop by name since columns get shifted around a bit - so with names I know exactly what I am dropping. One workaround I use is to use grep with names(df) to identify the positions of the columns that I want to drop.

This is a little tricky so be careful. If I want to drop all columns that start with "Cofactor" in the name:

dropcols <- grep("Cofactor", names(df))

# or to ignore case
# grep("Cofactor", names(df), ignore.case = TRUE)
# [1] 3 4 5

If I only wanted to drop Cofactor but keep Cofactor1 and Cofactor2, I could use \\b to put a word boundary on it:

dropcols <- grep("\\bCofactor\\b", names(df))
[1] 3

So to drop the columns, you can simply index like so:

dropcols <- grep("Cofactor", names(df))
df[, -dropcols]

#    Age Gender
# 1   31 Female
# 2   79   Male
# 3   51 Female
# 4   14   <NA>
# 5   67   Male
# 6   42   <NA>
# 7   50   <NA>
# 8   43   Male
# 9   NA   Male
# 10  14   Male
jpsmith
  • 11,023
  • 5
  • 15
  • 36