Subset rows for each group based on a character in a column and order of occurrence in a data frame

Question

I have a data similar to this.

B <- data.frame(State = c(rep("Arizona", 8), rep("California", 8), rep("Texas", 8)), 
  Account = rep(c("Balance", "Balance", "In the Bimester", "In the Bimester", "Expenses",  
  "Expenses", "In the Bimester", "In the Bimester"), 3), Value = runif(24))

You can see that Account has 4 occurrences of the element "In the Bimester", two "chunks" of two elements for each state, "Expenses" in between them.

The order here matters because the first chunk is not referring to the same thing as the second chunk.

My data is actually more complex, It has a 4th variable, indicating what each row of Account means. The number of its elements for each Account element (factor per se) can change. For example, In some state, the first "chunk" of "In the Bimester" can have 6 rows and the second, 7; but, I cannot differentiate by this 4th variable.

Desired: I'd like to subset my data, spliting those two "In the Bimester" by each state, subsetting only the rows of the first "chunks" by each state or the second "chunks".

I have a solution using data.table package, but I'm finding it kind of poor. any thoughts?

library(data.table)
B <- as.data.table(B)
B <- B[, .(Account, Value, index = 1:.N), by = .(State)]
x <- B[Account == "Expenses", .(min_ind = min(index)), by = .(State)]
B <- merge(B, x, by = "State")
B <- B[index < min_ind & Account == "In the Bimester", .(Value), by = .(State)]

M-- · Accepted Answer · 2017-10-04T21:24:21.410

You can use dplyr package:

library(dplyr)
B %>% mutate(helper = data.table::rleid(Account)) %>% 
      filter(Account == "In the Bimester") %>% 
      group_by(State) %>% filter(helper == min(helper)) %>% select(-helper)

# # A tibble: 6 x 3
# # Groups:   State [3]
#        State         Account      Value
#       <fctr>          <fctr>      <dbl>
# 1    Arizona In the Bimester 0.17730148
# 2    Arizona In the Bimester 0.05695585
# 3 California In the Bimester 0.29089678
# 4 California In the Bimester 0.86952723
# 5      Texas In the Bimester 0.54076144
# 6      Texas In the Bimester 0.59168138

If instead of min you use max you'll get the last occurrences of "In the Bimester" for each State. You can also exclude Account column by changing the last pipe to select(-helper,-Account).

p.s. If you don't want to use rleid from data.table and just use dplyr functions take a look at this thread.

Subset rows for each group based on a character in a column and order of occurrence in a data frame

1 Answers1