-1

I have a large dataset in R where each subject (Label: 1, 2, 3,...) gets scanned 2 or more times for fat mass, lean mass, etc. at several time points (Comments: PRE F1 BMR, POST F1 BMR,..). Some scans are erratic, so we can't just average all of them. I need a way to automatically select the rows with the two best (closest) measurements for one of the variables (Fat). Here's what the dataset looks like:

Example of dataset with multiple scans for subject 16 in POST F1 BMR

I've been trying to group the data by Label and Comments, but then is there a way to slice out the two rows within those groups where the Fat measurements are closest?

(P.S. Still a struggling R user and first time posting on StackOverflow, so forgive the layout!)

edited: Here's a simple test case with the intended result --

set.seed(2)
df <- data.frame(Fat=sample(1:10, 12, replace=T), 
             Lean=sample(1:5, 12, replace=T), 
             Label=rep(1:2, c(5,7)),
             Comments=rep(c("PRE BMR", "POST BMR", "PRE BMR", "POST 
             BMR"), c(2,3,2,5)))

dfresults<-df[-c(4,8,9,12),]
celow
  • 75
  • 1
  • 6
  • 1
    You need to give us a small example of data to work on. Eg `dput(head(data))` – Onyambu Jan 25 '18 at 17:29
  • This might work as a simplified version: df <- data.frame(Fat=sample(1:10, 10, replace=T), Lean=sample(1:5, 10, replace=T), Label=rep(1:2, c(5,5)), Comments=rep(c("PRE BMR", "POST BMR", "PRE BMR", "POST BMR"), c(2,3,3,2))) – celow Jan 25 '18 at 18:02
  • Can you please include this is your question with the output you expect from it – Onyambu Jan 25 '18 at 18:04

1 Answers1

1

I think this is probably what you're looking for:

  1. First group your labels
  2. Arrange by fat to sort ascending
  3. Make a new column that has the differences (diffvals1)
  4. Arrange by fat to sort descending.
  5. Make a new column that has the differences (diffvals2)
  6. Now filter by the min for each new column to find the two closest values.

library(tidyverse)
df %>%
  groupby(label) %>%
  arrange(fat) %>%
  mutate(diffvals1=ave(fat,FUN=abs(function(x) c(NA,diff(x))))) %>% 
  arrange(desc(fat)) %>% 
  mutate(diffvals2=ave(fat,FUN=abs(function(x) c(NA,diff(x))))) %>%
  drop_na() %>% 
  filter(diffvals1==min(diffvals1) | diffvals2==min(diffvals2))

Note: This method will give you back more than 2 values under a couple of conditions:

  • You have a series of 3 or more values that are exactly the same value
  • You have a couple of islands of values that are exactly the same.
Kyrubas
  • 877
  • 8
  • 23
  • I think this worked! `df3<-df %>% group_by(Label, Comments) %>% arrange(Fat) %>% mutate(diffvals= Fat - lag(Fat)) %>% arrange(diffvals) %>% filter(row_number() == 1 | row_number() == 2)` – celow Jan 25 '18 at 22:02