Returning first row of group

Question

I have a dataframe consisting of an ID, that is the same for each element in a group, two datetimes and the time interval between these two. One of the datetime objects is my relevant time marker. Now I like to get a subset of the dataframe that consists of the earliest entry for each group. The entries (especially the time interval) need to stay untouched.

My first approach was to sort the frame according to 1. ID and 2. relevant datetime. However, I wasn't able to return the first entry for each new group.

I then have been looking at the aggregate() as well as ddply() function but I could not find an option in both that just returns the first entry without applying an aggregation function to the time interval value.

Is there an (easy) way to accomplish this?

ADDITION: Maybe I was unclear by adding my aggregate() and ddply() notes. I do not necessarily need to aggregate. Given the fact that the dataframe is sorted in a way that the first row of each new group is the row I am looking for, it would suffice to just return a subset with each row that has a different ID than the one before (which is the start-row of each new group).

Example data:

structure(list(ID = c(1454L, 1322L, 1454L, 1454L, 1855L, 1669L, 
1727L, 1727L, 1488L), Line = structure(c(2L, 1L, 3L, 1L, 1L, 
1L, 1L, 1L, 1L), .Label = c("A", "B", "C"), class = "factor"), 
    Start = structure(c(1357038060, 1357221074, 1357369644, 1357834170, 
    1357913412, 1358151763, 1358691675, 1358789411, 1359538400
    ), class = c("POSIXct", "POSIXt"), tzone = ""), End = structure(c(1357110430, 
    1357365312, 1357564413, 1358230679, 1357978810, 1358674600, 
    1358853933, 1359531923, 1359568151), class = c("POSIXct", 
    "POSIXt"), tzone = ""), Interval = c(1206.16666666667, 2403.96666666667, 
    3246.15, 6608.48333333333, 1089.96666666667, 8713.95, 2704.3, 
    12375.2, 495.85)), .Names = c("ID", "Line", "Start", "End", 
"Interval"), row.names = c(NA, -9L), class = "data.frame")

Please provide a reproducible example with sample data, preferably using `dput`. — juba, Oct 18 '13 at 13:33
`aggregate(cols_to_aggregate ~ grouping_var, yourdata, head, 1)` might do what you need. — A5C1D2H2I1M1N2O1R2T1, Oct 18 '13 at 13:34
@AnandaMahto Ah, nice, I always forget about the formula form of `aggregate`. — juba, Oct 18 '13 at 13:40
Hi, thank you! Providing a reproducible example is difficult since the data is not meant to be public. I'll see if I can create some example. However, the overall problem seems understood. — fr3d-5, Oct 21 '13 at 13:50

score 19 · Accepted Answer · edited Dec 02 '15 at 12:07

19

By reproducing the example data frame and testing it I found a way of getting the needed result:

Order data by relevant columns (ID, Start)

ordered_data <- data[order(data$ID, data$Start),]
Find the first row for each new ID

final <- ordered_data[!duplicated(ordered_data$ID),]

edited Dec 02 '15 at 12:07

David Arenburg

91,361
17
137
196

answered Oct 21 '13 at 16:17

fr3d-5

792
1
6
27

score 13 · Answer 2 · answered Oct 18 '13 at 13:39

13

As you don't provide any data, here is an example using base R with a sample data frame :

df <- data.frame(group=c("a", "b"), value=1:8)
## Order the data frame with the variable of interest
df <- df[order(df$value),]
## Aggregate
aggregate(df, list(df$group), FUN=head, 1)

EDIT : As Ananda suggests in his comment, the following call to aggregate is better :

aggregate(.~group, df, FUN=head, 1)

If you prefer to use plyr, you can replace aggregate with ddply :

ddply(df, "group", head, 1)

answered Oct 18 '13 at 13:39

juba

47,631
14
113
118

1

Thank you! I think "head" function was what I was looking for. However, applying either the aggregate or the ddply function both crashes R. – fr3d-5 Oct 21 '13 at 13:52
1

@user2035177 Which is why we are asking for you to post a reproducible example - at best we can only guess what the issue is. Take 10 rows of your data, disguise the ID, and add some noise to the datetimes. Voila, a public dataset. – Blue Magister Oct 21 '13 at 15:20
Provided. As well as a way of solving my issue. Thanks for the help. Hope this will be helpful for others too. – fr3d-5 Oct 21 '13 at 16:20
As a small warning, long-running operations on millions of rows will take some time and freeze the RGui. The RGui will still respond once in a while, but will not accept input in the R Console. – Zimano Jan 16 '20 at 14:29

score 1 · Answer 3 · answered Jul 31 '21 at 05:12

1

Using ffirst from collapse

library(collapse)
ffirst(df, g = df$group)

data

df <- data.frame(group=c("a", "b"), value=1:8)

answered Jul 31 '21 at 05:12

akrun

874,273
37
540
662

score 1 · Answer 4 · answered Jul 31 '21 at 19:23

1

This could also be achieved by dplyr using group_by and slice-family of functions,

data %>%
        group_by(ID) %>%
        slice_head(n = 1)

answered Jul 31 '21 at 19:23

Serkan

1,855
6
20

Returning first row of group

4 Answers4

data

Linked

Related