1

This is my first post, so hopefully I explain what I need to do properly. I am still quite new to R and I may have read posts that answer this, but I just can't for the life of me understand what they mean. So apologies in advance if this has already been answered.

I have a very large data set of GPS locations from radiocollars and there are inconsistent numbers of locations for each day. I want to go through the dataset and select a single data point for each day based on the accuracy level of the GPS signal.

So it essentially looks like this.

Accuracy    Month    Day    Easting    Northing    Etc
   5          6       1     #######    ########     #
   3.2        6       1     #######    ########     #
   3.8        6       1     #######    ########     #
   1.6        6       2     #######    ########     #
   4          6       3     #######    ########     #
   3.2        6       3     #######    ########     #

And I want to pull out the most accurate point for each day (the lowest accuracy measure) while keeping the rest of the associated data.

Currently I have been using the tapply function

datasub1<-subset(data,MONTH==6)
tapply(datasub1$accuracy, datasub1$day, min)

Using this method I can successfully retrieve the minimum values, one for each day, however I cannot take the associated coordinates and timing, and all the other important information along with it, and as the data set is nearly 300 000 rows, I really can't do it by hand.

So essentially, I need to get the same results as the tapply, but instead of individual points, I need the entire row that that point is found in.

Thanks in advance to anyone that could lend a hand. If you need any more information, let me know, I'll try my best to get it to you.

Vincent Zoonekynd
  • 31,893
  • 5
  • 69
  • 78
HeidelbergSlide
  • 293
  • 3
  • 13

3 Answers3

6

You can use ddply: it cuts a data.frame into pieces (one per day) and applies a function to each piece.

# Sample data
n <- 100
d <- data.frame(
  Accuracy = round(runif(n, 0, 5), 1),
  Month    = sample(1:2, n, replace=TRUE),
  Day      = sample(1:5, n, replace=TRUE),
  Easting  = rnorm(n),
  Northing = rnorm(n),
  Etc      = rnorm(n)
)

# Extract the maximum for each day
# (In case of ties, you only have the first row)
library(plyr)
ddply( 
  d, 
  c("Month", "Day"), 
  function (u) u[ which.min(u$Accuracy), ] 
)
Vincent Zoonekynd
  • 31,893
  • 5
  • 69
  • 78
  • Excellent! Thanks so much, I stumbled across ddply in my searches quite a few times, but I didn't know how to apply it to my own stuff. Like I said, new to R and its definitely not my strong suit. Thanks again. I'm not entirely sure what the programming after the accuracy, month, and day are for. I was getting some wacky numbers with them, and when I pulled them out it all came out how I wanted it. But its all good now, saved me oodles of time. Thanks again. – HeidelbergSlide Jan 19 '12 at 02:06
  • @mathematical.coffee: I have replaced my max by a min, to match the original question. – Vincent Zoonekynd Jan 19 '12 at 02:20
  • @HeidelbergSlide, if this answer is the one that worked for you, then click on the little green tick on the top-left of the answer -- it lets future users with the same problem as you know how you fixed it so that they can too. – mathematical.coffee Jan 19 '12 at 03:30
2

This is one base solution using the split-apply paradigm that formed the basis for the plyr functions at least in the beginning:

lapply( 
     split(dat, list(dat$Month, dat$Day)),
         function(d) d[ which.min(d$Accuracy), ])
IRTFM
  • 258,963
  • 21
  • 364
  • 487
  • I was curious how someone would approach this in base. I didn't have an answer myself. Nice solution. Your solution seemed to generate a list structure (for me) and won't give the nice dataframe HeidelbergSlide seems to be after. Modifying your response and using sapply and then wrapping that with a t() seems to give a data frame that more closely represents the poster's desired outcome. – Tyler Rinker Jan 19 '12 at 04:54
2

So you don't want to aggregate in any way at all really. All you need to do is select the minimum for each day. So, all you need to do is find the minimums and select the matches.

mins <- ave(datasub1$accuracy, datasub1$day, FUN = min)
datasub1[ datasub1$accuracy == mins, ]

If you need day by month or year or whatever then just add them in as a list to the second argument of ave. Here's an alternate syntax.

mins <- with( datasub1, ave(accuracy, day, month, FUN = min) )
John
  • 23,360
  • 7
  • 57
  • 83
  • I don't think this helps for "...I need the entire row that that point is found in." It will only return the accuracy and day columns. – WhiteViking Sep 01 '15 at 23:04
  • fixed now... no edit history on this so either I missed that or it was added later – John Sep 02 '15 at 05:07