6

I am running into an issue with my data where I want to take the first observed ob score score for each individual id and subtract that from that last observed score.

The problem with asking for the first observation minus the last observation is that sometimes the first observation data is missing.

Is there anyway to ask for the first observed score for each individual, thus skipping any missing data?

I built the below df to illustrate my problem.

help <- data.frame(id = c(5,5,5,5,5,12,12,12,17,17,20,20,20),
                   ob = c(1,2,3,4,5,1,2,3,1,2,1,2,3),
                   score = c(NA, 2, 3, 4, 3, 7, 3, 4, 3, 4, NA, 1, 4))

   id ob score
1   5  1    NA
2   5  2     2
3   5  3     3
4   5  4     4
5   5  5     3
6  12  1     7
7  12  2     3
8  12  3     4
9  17  1     3
10 17  2     4
11 20  1    NA
12 20  2     1
13 20  3     4

And what I am hoping to run is code that will give me...

   id ob score  es
1   5  1    NA  -1
2   5  2     2  -1
3   5  3     3  -1
4   5  4     4  -1
5   5  5     3  -1
6  12  1     7   3
7  12  2     3   3
8  12  3     4   3
9  17  1     3  -1
10 17  2     4  -1
11 20  1    NA  -3
12 20  2     1  -3
13 20  3     4  -3

I am attempting to work out of dplyr and I understand the use of the 'group_by' command, however, not sure how to 'select' only first observed scores and then mutate to create es.

Arun
  • 116,683
  • 26
  • 284
  • 387
b222
  • 966
  • 1
  • 9
  • 19

3 Answers3

6

I would use first() and last() (both dplyr function) and na.omit() (from the default stats package.

First, I would make sure your score column was a numberic column with proper NA values (not strings as in your example)

help <- data.frame(id = c(5,5,5,5,5,12,12,12,17,17,20,20,20),
       ob = c(1,2,3,4,5,1,2,3,1,2,1,2,3),
       score = c(NA, 2, 3, 4, 3, 7, 3, 4, 3, 4, NA, 1, 4))

then you can do

library(dplyr)
help %>% group_by(id) %>% arrange(ob) %>% 
    mutate(es=first(na.omit(score)-last(na.omit(score))))
MrFlick
  • 195,160
  • 17
  • 277
  • 295
  • When I run this code on my actual data, I get an error --> "Error: Don't know how to generate default for object of class numeric". The variables are numeric, and there are a fair number of NAs, e.g. some id's have nothing but NA and others do not. Any thoughts? – b222 Jun 11 '15 at 23:05
  • If you have nothing but NA values, what do you want to return? It would have been nice if your sample data included this scenario along with the desired output. – MrFlick Jun 11 '15 at 23:12
  • Agreed. So the problem rests in there are 3 different variables for score and each id has a score for one of the three variables. I assumed that I could just run the code for each one, but if the id in the group_by does not have any data for the score, then the error message pops up... likely because the na.omit takes out all data and there is nothing to subtract. – b222 Jun 11 '15 at 23:28
  • If the whole group only contains `NA`s, you can tell `first()` or `last()` how to handle that with the `default=` argument, like so: `es = first(na.omit(score), default=NA)` – Andrew Oct 01 '15 at 15:30
  • @MrFlick - This code no longer works and I stumped trying to figure it out. I'm guessing it is related to an update in dplyr (now on version 0.5.0). I receive the error "Error: Unsupported vector type language" when I run the above code. Any ideas what is going on? The structure of the data is all integer or numeric. – b222 Oct 26 '16 at 20:09
  • @bpace dplyr tries to do some fancy things with functions sometimes which seems to break nesting. If you use this `mutate(es=dplyr::first(na.omit(score))-dplyr::last(na.omit(score)))` I think it will still work. Using `dplyr::first()` rather than `first()` seems to trick dplyr into using the non-optimized path but at least it still works. – MrFlick Oct 26 '16 at 20:47
1
library(dplyr)

temp <- help %>% group_by(id) %>% 
     arrange(ob) %>%
     filter(!is.na(score)) %>% 
     mutate(es = first(score) - last(score)) %>%
     select(id, es) %>%
     distinct()

help %>% left_join(temp)
Yifei
  • 137
  • 1
  • 7
0

This solution is a little verbose, only b/c it relies on a couple of helper functions FIRST and LAST:

# The position (indicator) of the first value that evaluates to TRUE.
LAST  <-  function (x, none = NA) {
    out <- FIRST(reverse(x), none = none)
    if (identical(none, out)) {
        return(none)
    }
    else {
        return(length(x) - out + 1)
    }
}
# The position (indicator) of the last value that evaluates to TRUE.
FIRST  <-  function (x, none = NA) 
{
    x[is.na(x)] <- FALSE
    if (any(x)) 
        return(which.max(x))
    else return(none)
}

# returns the difference between the first and last non-missing values
diff2  <-  function(x)
    x[LAST(!is.na(x))] - x[FIRST(!is.na(x))]


library(dplyr)
help %>% 
    group_by(id) %>% 
    arrange(ob) %>% 
        summarise(diff = diff2(score))
Jthorpe
  • 9,756
  • 2
  • 49
  • 64