0

I'am just learning R and I try to find a way to select rows from a data.frame

I would like to select for every ID one row for each group "pre" and post which has the smalles difference in abs(d_days), i.e have a similar interval pre-ref, ref-post. The minimum difference should be 1 year.

I am not looking for a maximum or minimum difference, but for the closest/nearest/ most similar difference > 1 year.

my test data.frame looks like:

"ID"      "date"      "d_days" "group"
"00377698" 2006-11-15 -1006   "pre"
"00377698" 2009-08-16 -1      "pre"
"00377698" 2009-08-17 0       "ref"
"00377698" 2009-08-24 7      "post"
"00377698" 2009-09-03 17     "post"
"00377698" 2009-10-09 53     "post"
"00377698" 2010-02-26 193    "post"
"00377698" 2010-08-27 375    "post"
"00377698" 2010-11-26 466    "post"
"00377698" 2011-08-24 737    "post"
"00540688" 2009-06-26 -1664  "pre"
"00540688" 2009-08-20 -1609  "pre"
"00540688" 2009-11-20 -1517  "pre"
"00540688" 2010-11-17 -1155  "pre"
"00540688" 2011-12-07 -770   "pre"
"00540688" 2014-01-09 -6     "pre"
"00540688" 2014-01-15 0      "ref"
"00540688" 2014-01-20 5      "post"
"00540688" 2014-03-05 49     "post"
"00540688" 2015-04-29 469    "post"
"00540688" 2015-09-30 623    "post"
"00540688" 2016-05-13 849    "post"

my attempts:

I tried somethinkg like data.frame %>% group_by(ID,group) %>% filter (group=="pre"| group=="post" & abs(d_days > 365)) %>% summarise(b = nth(abs(d_days[1]), which.max(abs(d_days[2]))))

I also tried roll=nearest R - merge dataframes on matching A, B and *closest* C?

I also tried this find value closest to x by group in dplyr, but it did not work out since I am not looking for values close to a specific value but just the "closest" of two sub-groups.

Unfortunately, I could not get what I am looking for:

"ID"      "date"      "d_days" "group"
"00377698" 2006-11-15 -1006   "pre"
"00377698" 2009-08-17 0       "ref"
"00377698" 2011-08-24 737    "post"
"00540688" 2011-12-07 -770   "pre"
"00540688" 2014-01-15 0      "ref"
"00540688" 2015-09-30 894    "post"

Many thanks for any help!

captcoma
  • 1,768
  • 13
  • 29
  • I'm a little confused about your requirements, are you only looking for "pre" and "post" groups that are >365 like in your attempts? I have something like `df %>% group_by(ID, group) %>% arrange(ID, date) %>% filter(abs(d_days) > 365) %>% mutate(range = as.Date(date)-lag(as.Date(date),1))` – Geochem B Jul 21 '17 at 21:33
  • @ Geochem, many thanks for your answer. I am looking for "pre" and "post" groups that are >365 and that have the closest time interval (e.g. from all "pre's" and "post's" with the ID 00540688 the closest interval are 770 and 894. – captcoma Jul 21 '17 at 21:43
  • Why is the "ref" in your example of the final solution? Are you looking for the closest value to "ref" since it is between the two ID's in this case? – Geochem B Jul 21 '17 at 21:52
  • I added the ref in my solution just to illustrate my data. It is not really needed. I am not looking for the closest value to "ref", I am only looking for "similar" time intervals between pre-ref and ref-post that are > 365 days. – captcoma Jul 21 '17 at 22:04

1 Answers1

0

Here is my best shot getting values close to 0 which would fufills half of the issue. Close but not complete.

df %>%
  filter(group != "ref",
         abs(d_days) > 365) %>% 
  group_by(ID, group) %>%
  arrange(ID, date) %>%
  filter(abs(d_days - 0) == min(abs(d_days - 0)))

      ID       date d_days  group
   <int>     <fctr>  <int> <fctr>
1 377698 2006-11-15  -1006    pre
2 377698 2010-08-27    375   post
3 540688 2011-12-07   -770    pre
4 540688 2015-04-29    469   post
Geochem B
  • 418
  • 3
  • 13