0

I have 2 different data frames of the following format:

DF1 -

  v1 v2 v3 v4 v5
a 1  2  +
b 5  2  +  +
c 5  2  +     +
d 4  3     +  +
e 1  5  +     +
f 3  5 
g 4  2  
h 3  1  
i 5  5  +     +

DF2 -

  v1 v2 v3 v4 
a 1  2  +
b 5  2  +  +
c 5  2  +     
d 4  3     +  
e 1  5  +     
f 3  5 
g 4  2  
h 3  1  
i 5  5  +     

My script gives a scatter plot of v1 & v2, but first I remove rows that have at least one "+" in v3-v4 or v3-v5.

My data frames can be bigger with more v1-v2 pairs, but always have either the v3-v4 or v3-v5 columns with "+". I adjust the code manually to specify columns to plot and which rows to remove depending on the DF format I am working on.

It works well but I wanted to make the script more interactive as follows:

# Select v3-v4 or v3-v5 via interactive gui to give vector of column headers.
remove.vars.vector <- select.list(names(DF), # Select columns as vector of column header names via interactive gui.
                           multiple = TRUE, # Can choose multiple columns.
                           title = "Choose variables to remove from data set", # Title on gui.
                           graphics = TRUE) # Allow launch of gui.

# Return columns from DF with this vector of column headers.
remove.vars.subset <- DF[remove.vars.vector]
# Return rows that have at least one "+" in v3-v4 or v3-v5.
remove.vars.subset.+ <- subset(DF, remove.vars.subset == "+")
# Removes all rows that contain >=1 NA.
complete.data.+ <- remove.vars.subset.+[complete.cases(remove.vars.subset.+), ] 
# Combine by rows "complete.data.+" with DF.
combo.list <- rbind(DF,complete.data.+)
# Remove duplicate rows from combined data frame.
complete.data <- combo.list[!duplicated(combo.list, fromLast = FALSE) & !duplicated(combo.list, fromLast = TRUE),]

Problem: The above code doesn't completely strip the data frame of rows that contain at least one "+" in v3-4 or v3-5. The problem appears to be these lines:

# Return rows that have at least one "+" in v3-v4 or v3-v5.
    remove.vars.subset.+ <- subset(DF, remove.vars.subset == "+")

I also get a number of rows at the end with only NA in every cell hence complete.cases in the next line of code.

The final data frame therefore still contains some rows with "+" in v3-4 or v3-5.

Question:

Is there a better way to subset rows in a data frame using a vector of column headers that may contain "+" in their rows?

Thank you in advance.

EDIT - 09/08/2016 - 18:54 I just noticed something that I didn't clarify about my data frames. Some of the rows don't have "+" in v3-v4 or v3-v5. These are the rows that I eventually want to keep so I can plot the scatter. I've edited data frames accordingly. I'm just looking at answers to try and understand them. I'm quite new to R still.

Alex M
  • 165
  • 1
  • 2
  • 15

2 Answers2

0

Suppose your data DF is

> DF
  v1 v2 v3 v4 v5
1  1  2  +      
2  5  2  +  +   
3  5  2  +     +
4  4  3     +  +
5  1  5  +     +

and I pick v3 and v4. Then remove.vars.subset, following your code, is

> remove.vars.subset
  v3 v4
1  +   
2  +  +
3  +   
4     +
5  +   

and notice that remove.vars.subset == "+" evaluates as

> remove.vars.subset == "+"
        v3    v4
[1,]  TRUE FALSE
[2,]  TRUE  TRUE
[3,]  TRUE FALSE
[4,] FALSE  TRUE
[5,]  TRUE FALSE

What subset is then doing is to ask R to return rows from the dataframe where the condition evaluates to TRUE, i.e.:

DF[c(TRUE, TRUE, TRUE, FALSE, TRUE, FALSE, TRUE, FALSE, TRUE, FALSE),]

concatenating the first and second columns. But there are only 5 rows in the dataframe, whereas there are 10 elements in the logical vector. Since DF only has 5 rows, NAs are appended (so it's as if DF has 10 rows) and the expression is evaluated accordingly. So you see:

> subset(DF, remove.vars.subset == "+")
     v1 v2   v3   v4   v5
1     1  2    +          
2     5  2    +    +     
3     5  2    +         +
5     1  5    +         +
NA   NA NA <NA> <NA> <NA>
NA.1 NA NA <NA> <NA> <NA>

What you might want to try is

DF[!apply(remove.vars.subset, MAR=1, function(x) any(x=="+")), ]
> DF[!apply(remove.vars.subset, MAR=1, function(x) any(x=="+")), ]
[1] v1 v2 v3 v4 v5
<0 rows> (or 0-length row.names)

which returns no rows because all the rows (given the choice of v3 and v4) have at least one "+" in them. But suppose we chose v4 and v5:

> DF[!apply(remove.vars.subset, MAR=1, function(x) any(x=="+")), ]
  v1 v2 v3 v4 v5
1  1  2  +   
Weihuang Wong
  • 12,868
  • 2
  • 27
  • 48
  • Just tried your suggestion and it worked perfectly. You correctly surmised that some rows wouldn't have "+" (I edited my post before I'd tried your suggestion). Apply family of functions has moved to the top of my list for things to explore in R. Thank you so much for the quick response. – Alex M Aug 09 '16 at 18:37
  • PS: I verified by row binding my original method data frame to the one produced with your amendment, and then removed duplicate rows. Answer was data frame with no rows....awesome. My brain is slowly taking in R programming logic! – Alex M Aug 09 '16 at 18:45
0

I have a solution where you don't choose the column but all rows with "+" and NAs are removed from the dataframe. I don't know if that helps. It is based on the following question: Better way to filter a data frame with dplyr using OR?

v1 <- c(1,2,3,4,5,NA)
v2 <- c(1,2,3,4,5,NA)
v3 <- c("","+","+","","",NA)
v4 <- c("","+","","+","",NA)
v5 <- c("","+","","","",NA)

D1 <- cbind.data.frame(v1,v2,v3,v4,v5,stringsAsFactors=F)

library(dplyr)

remove.vars.vector <- c("v3","v4","v5")
condition <- c("+",NA)

D1 %>%
  filter(rowSums(sapply(D1, FUN = "%in%", condition)) == 0) -> D1_new

EDIT: I found a possibility to choose the columns, unfortunately I didn't find a solution to select the columns by character vector:

D1 %>% select_(remove.vars.vector) -> D1_sub # NOT working

D1 %>% select(v3:v5) -> D1_sub # working
D1 %>% select(v3,v4,v5) -> D1_sub # working
D1 %>% select_("v3","v4","v5") -> D1_sub # working

D1 %>%
  filter(rowSums(sapply(D1_sub, FUN = "%in%", condition)) == 0) -> D1_new
Community
  • 1
  • 1
PhiSeu
  • 301
  • 2
  • 9
  • Thank you for your suggestion. I went round in circles for a while trying to subset by a vector of column headers. http://stackoverflow.com/users/6455166/weihuang-wong did the trick – Alex M Aug 09 '16 at 18:40