Subsetting with multiple conditions in very large data set

Question

I have a matrix that is approximately 430 X 20,000. Each row is a person, each column is a project they have worked on. Each cell has a value of either 0 - (not involved), 1 - (project head, only one per project), 2 - (project helper). I am trying to look at only the projects that a single person was the head of. I only want to look at one person at a time. So for Person A I need r to drop all columns where that person's value isn't 1. But I want to retain all the data about other individuals in those columns.

Ex:

 Name   Project 1   Project 2......Project 2,000
Person A      1            0                    2
Person B      0            1                    1
Person C      2            2                    2

I am trying to get something for Person B that drops columns they didn't head.

 Name    Project 2......   Project 2,000
Person A      0                    2
Person B      1                    1
Person C      2                    2

Sorry if this is obvious, for some reason I have really struggled to find examples for data this large (a.k.a I can't just type in the column names because there are too many). Any help would be greatly appreciated.

This isn't a very large dataset (by my standards) and some extra optimization probably won't be needed. Please provide a reproducible example with what the desired result would look like. — Roman Luštrik, Jul 05 '15 at 13:38
If you are looking for each person, then you may need a loop — akrun, Jul 05 '15 at 13:41
Assuming your data is called `df` and the first column is Name and all the other columns are projects, this should do the job for one person, e.g. "Person B": `df_B = df[, df[2,] == 1)]`. If you need this for more persons make a loop as akrun suggested and store your output in a `list` — mts, Jul 05 '15 at 14:19
Hi Roman, apologies, I am not sure what you mean by a "reproducible example"....basically by the end I am hoping to have 430 separate matrices (one for every member, which includes only data for the projects in which that member was the head). — AgeTex, Jul 05 '15 at 15:26
mts: That was exactly what I needed!!!!! THANKS!!!!!!!!! for anyone else who stumbles across this there is a missing parentheses in the code mts posted but works great!: df_B <- def[ , (df[2, ] ==1) ] — AgeTex, Jul 05 '15 at 15:37

score 2 · Answer 1 · answered Jul 05 '15 at 14:32

So what you are trying to do is just select columns of a dataframe based on the values in one of the rows. Using a dataframe similar to your example:

> df
#      Name Project1 Project2 Project2000
#1 Person A        1        0           2
#2 Person B        0        1           1
#3 Person C        2        2           2

In order to select the columns for, say, "Person B", you need a logical vector indicating which are the columns to keep, i.e. a vector which has length the same as the number of columns in your dataframe, and has the value TRUE for the columns to include in the result, and FALSE otherwise.

You can almost do this with:

> leadB <- df[2,]==1
#   Name Project1 Project2 Project2000
#2 FALSE    FALSE     TRUE        TRUE

which picks out the correct projects, but would drop the Name column; to also include that column, we use:

> leadB <- c(TRUE, df[2,-1]==1)
#[1]  TRUE FALSE  TRUE  TRUE

Then use this vector to select columns from the dataframe:

> df_B <- df[,leadB]
#      Name Project2 Project2000
#1 Person A        0           2
#2 Person B        1           1
#3 Person C        2           2

Of course, you can do this in a single line, and there is nothing special about the "Person B" row, so you could use a function which returns the desired dataframe for the person in row n:

leader_df <- function(n){
    df[,c(TRUE, df[n,-1]==1)]
}

Then evaluating leader_df(n) over values of n from 1 to the number of rows will give you the dataframes for each project leader.

This can actually be even cleaner if `Names` is the `row.names` of your dataframe, rather than a column (so all the df values are numerical). Depending on your use case, this may or may not be desirable. — tegancp, Jul 05 '15 at 14:34

score 1 · Answer 2 · answered Jul 05 '15 at 15:08

You can easily solve this by first searching for the row that corresponds to the particular person under consideration. You can then find the relevant columns for which this person is the project leader and extract these columns from the dataframe (including the persons name). An example below:

Create the data:

> person = c("John", "Willy", "Bob", "Anna", "Tom","Billy") 
> project1 = c(1, 0, 2, 0, 0,2) 
> project2 = c(1, 2, 0, 2, 0,0) 
> project3 = c(2, 0, 1, 0, 2,0)       # df is a data frame
> project4 = c(0, 0, 0, 1, 2,0)
> projects <- data.frame(person,project1,project2,project3,project4)

> projects
  person project1 project2 project3 project4
1   John        1        1        2        0
2  Willy        0        2        0        0
3    Bob        2        0        1        0
4   Anna        0        2        0        1
5    Tom        0        0        2        2
6  Billy        2        0        0        0

Obtain the relevant information for John. Note that we need to explicitly add the column with the person names:

> findPerson = "John"
> rowIndex <- which(projects$person==findPerson)
> columnIndex <- c(1,which(projects[rowIndex,]==1))
> if(length(columnIndex) > 1) # Only generate table if projectleader for at least one project
+   result <- projects[,columnIndex]

> result
  person project1 project2
1   John        1        1
2  Willy        0        2
3    Bob        2        0
4   Anna        0        2
5    Tom        0        0
6  Billy        2        0

score 0 · Accepted Answer · answered Jul 05 '15 at 15:42

Assuming your data is called df and the first column is Name and all the other columns are projects, this should do the job for one person, e.g. "Person B":

df_B = df[, (df[2,] == 1)]

If you need this for more persons put it in a loop as akrun suggested and store your output in a list.

Subsetting with multiple conditions in very large data set

3 Answers3