Extract intervals inside groups in a dataframe , using information of another dataframe .

Question

Like i said in the title, my purpose is to extract intervals of subset of my dataframe using information of another dataframe.

my input:

df1:

  subject         x      y
7G001-0024-10   0,00    15
7G001-0024-10   97,29   18
7G001-0024-10   197,34  21
7G001-0024-10   314,66  22
7G001-0024-10   482,77  25
7G001-0030-10   0,00    12
7G001-0030-10   99,50   16
7G001-0030-10   184,37  20
7G001-0030-10   301,89  25
7G001-0030-10   585,67  27
     ...         ...   ...

df2 :

    subject   Threshold 
7G001-0024-10   177,08
7G001-0030-10   385,13
    ...          ...

For each subject I would like to extract in the df1 the x and y data between 0 and the threshold value of each subject contain in df2 for get an output in this spirit :

  subject         x      y
7G001-0024-10   0,00    15
7G001-0024-10   97,29   18
7G001-0030-10   0,00    12
7G001-0030-10   99,50   16
7G001-0030-10   184,37  20
7G001-0030-10   301,89  25
    ...          ...   ...

My first idea , it was using which() inside ddply function :

break=ddply(df1,.(subject),summarize,fun=x[which(x>=0 & x<Threshold )])

but I am stuck , I didn't see how to indicate the changement of my Threshold (df2) inside the which function.

Well, if anybody can tell me how deal with it (with my first intuition or not )

Sorry for the poor English.

Sven Hohenstein · Accepted Answer · 2012-09-06T10:19:24.407

First, your data:

df1 <- read.table(text = "subject         x      y
7G001-0024-10   0,00    15
7G001-0024-10   97,29   18
7G001-0024-10   197,34  21
7G001-0024-10   314,66  22
7G001-0024-10   482,77  25
7G001-0030-10   0,00    12
7G001-0030-10   99,50   16
7G001-0030-10   184,37  20
7G001-0030-10   301,89  25
7G001-0030-10   585,67  27", header = TRUE, dec = ",")

df2 <- read.table(text = "subject   Threshold 
7G001-0024-10   177,08
7G001-0030-10   385,13", header = TRUE, dec = ",")

You can use simple apply to solve the task:

do.call("rbind", apply(df2, 1, FUN = function(a) {df1[a[1] == df1$subject & df1$x >= 0 & df1$x <= as.numeric(a[2]), ]}))

#         subject      x  y
# 1 7G001-0024-10   0.00 15
# 2 7G001-0024-10  97.29 18
# 6 7G001-0030-10   0.00 12
# 7 7G001-0030-10  99.50 16
# 8 7G001-0030-10 184.37 20
# 9 7G001-0030-10 301.89 25

How does it work?

First, the function apply(df2, 1, FUN) applies a function to each row in the data frame df2. The value 1 means that the function is applied to the 1st dimension of the object (the second dimension would be columns).

Have a look at a simple function. It just returns the first and second row of df2. Note that in the output, the rows are arranged as columns.

> apply(df2, 1, FUN = function(a) a)
          [,1]            [,2]           
subject   "7G001-0024-10" "7G001-0030-10"
Threshold "177.08"        "385.13"

Since we want to extract a subset of df1 a more complex function is needed. So, I specified:

FUN = function(a) {df1[a[1] == df1$subject & df1$x >= 0 & df1$x <= as.numeric(a[2]), ]}

In this function, a represents a row of the data frame df2. This function is aplied two times, once for both rows of df2. a[1] is the subject number, a[2] is the corresponding threshold. The function extracts a subset of rows of the data frame df1 by three criteria:

The subjects are identical (a[1] == df1$subject)
The x value is at least zero (df1$x >= 0)
The x value is not higher than the threshold (df1$x <= as.numeric(a[2]))

Note: The value a[2] needs to be transformed to a number by as.numeric. This is necessary since the subject id in df2 is represented as character and thereby apply converts the whole row (including the threshold value) into characters.

Each of these criteria returns a logical vector. These vectors are combined with & into a single logical vector indicating whether all three criteria are fullfilled. With df1[logical.vector, ] all rows of df1 where the logical vector is TRUE are selected. Since nothing is specified after the ,, all columns are selected.

The rows of df1 for which all three criterial are fullfilled are returned by the apply function.

> apply(df2, 1, FUN = function(a) {df1[a[1] == df1$subject & df1$x >= 0 & df1$x <= as.numeric(a[2]), ]})
[[1]]
        subject     x  y
1 7G001-0024-10  0.00 15
2 7G001-0024-10 97.29 18

[[2]]
        subject      x  y
6 7G001-0030-10   0.00 12
7 7G001-0030-10  99.50 16
8 7G001-0030-10 184.37 20
9 7G001-0030-10 301.89 25

The function apply returns a list of two data frames, one for each row of df2.

In the last step, the data frames in the list are combined into one data frame. The function do.call("rbind", list) executes the function rbind and passes the arguments in the list to it. For a list of length 2, this is equivalent to rbind(list[[1]], list[[2]]). In this way, both data frames in the list returned by applyare combined.

Thanks a lot for your answer , it's works fine ! if it is not too much to ask can you comment your code ? thank in advance . — mat, Sep 06 '12 at 08:49
I updated my answer with an additional section ("How does it work?"). — Sven Hohenstein, Sep 06 '12 at 09:34

Extract intervals inside groups in a dataframe , using information of another dataframe .

1 Answers1

How does it work?