1

I have a slight program issue I cannot seem to figure out. I am wondering how i can in an elegant way count the number of consecutive numbers in a sequence starting from different values per group in r

for example, we have a data frame with names and numbers and would like to find minimize the data frame keeping only 1 entry per name and in the other the number of consecutive entries per name

names <- c(rep("bob",5), rep("henry",5), rep("maria",5))
goals <- c(1,2,3,5,4, 4,3,4,5,2, 1,2,4,6,5)
input.df <- data.frame(names, goals)

so starting from 1 the output data frame would be like the one below, where "bob" has a 3, since he had goals from 1 to 3 sequential entries in goals, henry has 0, cause he did not have a 1 or any ordered entries and maria has 2 because she had entries from 1 to 2

names <- c("bob", "henry", "maria")
runs <- c("3", "0", "2")
output.df.from.1 <- data.frame(names, goals)

and starting from 3, both bob and maria would have a 0 but henry would now have a 3 since he has 3, 4, 5.

names <- c("bob", "henry", "maria")
runs <- c("0", "3", "0")
output.df.from.3 <- data.frame(names, goals)

I am certain there must be a simple solution to this but I have not been able to find any, however I might be searching for the wrong things.

Does anyone have a suggestion?

  • Is there a particular reason your `goals` column is string instead of numbers? It seems as if you want to count them as numbers, but you're explicitly casting them into string the way your making a frame. Perhaps you should just do `input.df <- data.frame(names, goals)` instead of the unnecessarily complex `as.data.frame(cbind(..))` method (which is rarely necessary/useful)? – r2evans Oct 22 '21 at 19:30
  • `henry` has a 1, even if the entries are out of order. Your rules are a little unclear, are you saying for each name that the first goal must be `1` and you only count those that increment by 1 for each row? – r2evans Oct 22 '21 at 19:36
  • hi r2evans, sorry for being unclear about the rules, yes this is exactly what i mean. and here is no specific reason of why it should be a string instead of numbers i'll edit the question – user547928359 Oct 22 '21 at 20:36
  • Do you want list of dataframes as final output? Why do you check consecutive values for goals for only 1 and 3 and not other numbers like 2, 4, 5, 6 ? – Ronak Shah Oct 23 '21 at 01:55

1 Answers1

0

Here is a possible solution to your answer. The idea is to 1) first find out the (multiple) consecutive numbers for each person, then 2) given a value, find out the length of the consecutive numbers starting from the value.

I changed your example data a bit to take into account the case where each person can have multiple consecutive numbers. (e.g. bob now have numbers 1,2,3,5,4, 7,8,9, and the consecutive groups are 1,2,3 and 7,8,9).

  1. Find the consecutive numbers for each person. First group by names, within each group, find the previous and next numbers of the goals. If it's consecutive, then either previous_goal - current_goal = -1 or next_goal - current_goal = 1. Note I use both previous/next in order to retain all the values in a consecutive group.
library(tidyverse)
names <- c(rep("bob",8), rep("henry",5), rep("maria",5))
goals <- c(1,2,3,5,4, 7,8,9, 4,3,4,5,2, 1,2,4,6,5)
df1 <- data.frame(names, goals) 

df2 <- df1 %>% 
  group_by(names) %>%  
  mutate(goals_lag = lag(goals) - goals) %>% 
  mutate(goals_lead = lead(goals) - goals) %>% 
  filter(goals_lag == -1 | goals_lead == 1) %>% 
  select(-goals_lag, -goals_lead)
  1. Write a function to calculate the length of consecutive numbers starting from a given value. In the case of bob has two consecutive groups 1,2,3 and 7,8,9. If the given value is 1, then the length is supposed to be 3 not 6. Therefore we need to know where are the start positions of different consecutive groups (starting index is 4 for group 7,8,9). After we locate the position of the given value (if given value is 1, the index is 1), we can use the start position of the next group minus the given value position (4-1=3 in this case), that's how to calculate the length).
cons_len <- function(df, name, start_val){
  
# take goals as a vector
  vec <- (df %>% filter(names == name))$goals
# find the starting positions of different groups
  vec_stops <- which( (vec - c(vec[1] - 1, vec[-length(vec)])) != 1)
# find the index of the given value
  vec_start <- which(vec == start_val)
  
# if not find the value, return 0
  if (length(vec_start)==0) {
    return(0)
    
# if there is only one group of consecutive numbers
  } else if (length(vec_stops) == 0) {
    return(length(vec) - vec_start + 1)
    
  } else {
   
# if there are multiple groups of consecutive numbers
    len <- vec_stops[vec_start <= vec_stops][1] - vec_start
    return(ifelse(len == 1, 0, len))
  }
}

# apply to each name
sapply(unique(df1$names), function(name) cons_len(df2, name, 1))
# bob henry maria 
# 3     0     2 

sapply(unique(df1$names), function(name) cons_len(df2, name, 3))
# bob henry maria 
# 0     3     0 
Xiang
  • 314
  • 1
  • 9