-1

I would like to subset my tbl based on the minimum value of a variable.

I found a SO post already here using data.table. Is there a way using dplyr?

> glimpse(x)
Observations: 3,074,921
Variables: 9
$ sessionId <chr> "1468614023881.kvz0h9ofxbt9", "1469063434066.e9h65wdygb9", "1469240810386.2k47r07tx1or", "146933076...
$ dateHour    <chr> "2016080106", "2016080118", "2016080119", "2016080120", "2016080108", "2016080106", "2016080117", "...
$ minute      <ord> 25, 10, 30, 38, 32, 12, 42, 32, 42, 39, 32, 20, 0, 4, 39, 46, 54, 32, 46, 46, 33, 53, 51, 2, 22, 36...
$ userType    <chr> "New Visitor", "New Visitor", "New Visitor", "New Visitor", "New Visitor", "New Visitor", "Returnin...
$ region      <chr> "Virginia", "Washington", "Chihuahua", "Missouri", "Nevada", "Minnesota", "Oklahoma", "(not set)", ...
$ metro       <chr> "Roanoke-Lynchburg VA", "Seattle-Tacoma WA", "(not set)", "Joplin MO-Pittsburg KS", "Reno NV", "Min...
$ city        <chr> "Roanoke", "Camano Island", "Ciudad Juarez", "Joplin", "Reno", "Owatonna", "Edmond", "Port-au-Princ...
$ sessions    <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
$ dhm         <chr> "201608010625", "201608011810", "201608011930", "201608012038", "201608010832", "201608010612", "20...

dhm variable is the concatenation of dateHour and minute columns. My data has some duplicate session ids and I would like to retrieve the rows where, in the case of a duplicate, I get the earliest entry for the sessionId based on min(dhm).

Community
  • 1
  • 1
Doug Fir
  • 19,971
  • 47
  • 169
  • 299

1 Answers1

4

group data per session and arrange by dhm. Then filter out only first rows (per session)

dat %>% group_by(sessions) %>% arrange(dhm) %>% filter(row_number() == 1)

or as pointed out in the comments

dat %>% group_by(sessions) %>% filter(which.min(dhm)==row_number())
Wietze314
  • 5,942
  • 2
  • 21
  • 40
  • 1
    I’m pretty sure it’s more efficient to use `which.min` and subset by that, rather than to `arrange` the whole group (O(*n*) vs O(*n* log *n*). – Konrad Rudolph Jan 05 '17 at 12:18
  • 1
    And if you are going to arrange, probably better to first arrange the whole data set rather rearranging each group for each session – David Arenburg Jan 05 '17 at 12:22
  • Cool, thanks for the advice! – Wietze314 Jan 05 '17 at 12:25
  • Thanks for the answer, comments and link to similar question. I tried the second option and received the error "Error: filter condition does not evaluate to a logical vector. " So then I tried the option using which.min(sessionId) on the link under "This question already has an answer here:". That did the trick. – Doug Fir Jan 05 '17 at 13:03