1

I have a data frame that looks like this toy data frame:

df <- data.frame(company=c("company_a","company_b","company_b", "company_a","company_b","company_a"), 
         fruit=c("peaches, apples; oranges","apples; oranges; bananas","oranges; pears","bananas; apples; oranges; pears","apples; oranges; pears","bananas; apples; oranges; pears; peaches"),
         year=c("2010","2011","2014","2014", "2016","2018"))    


> df
    company                                    fruit year
1 company_a                 peaches; apples; oranges 2010
2 company_b                 apples; oranges; bananas 2011
3 company_b                           oranges; pears 2014
4 company_a          bananas; apples; oranges; pears 2014
5 company_b                   apples; oranges; pears 2016
6 company_a bananas; apples; oranges; pears; peaches 2018

Desired Outcome

I would like a column (new_occurrences) with the sum of fruits that has never appeared in the previous five years.

For example, row 4: company_a = bananas and pears never appeared in the previous 5 years, thus new_fruit = 2.

That will look like this:

> df
    company                                    fruit year new_occurrences 
1 company_a                 peaches; apples; oranges 2010 3
2 company_b                 apples; oranges; bananas 2011 3
3 company_b                           oranges; pears 2014 1
4 company_a          bananas; apples; oranges; pears 2014 2
5 company_b                   apples; oranges; pears 2016 0
6 company_a bananas; apples; oranges; pears; peaches 2018 1

Attempt

I tried the answer from this question, for which I created a function which is the opposite of '%in%' and use it in df3.

'%!in%' <- function(x,y)!('%in%'(x,y))

# clean up column classes
df[] <- lapply(df, as.character)
df$year <- as.numeric(df$year)

library(data.table)
setDT(df)

# create separate column for vector of fruits, and year + 5 column
df[, fruit2 := strsplit(gsub(' ', '', fruit), ',|;')]
df[, year2 := year + 5]

# Self join so for each row of df, this creates one row for each time another  
# row is within the year range 
df2 <- df[df, on = .(year <= year2, year > year, company = company)
      , .(company, fruit, fruit2, i.fruit2, year = x.year)]

# create a function which is the opposite of '%in%'
'%!in%' <- function(x,y)!('%in%'(x,y))

# For each row in the (company, fruit, year) group, check whether 
# the original fruits are  in the matching rows' fruits, and store the  result
# as a logical vector. Then sum the list of logical vectors (one for each row).
df3 <- df2[, .(new_occurrences = do.call(sum, Map(`%!in%`, fruit2, i.fruit2)))
       , by = .(company, fruit, year)]

# Add sum_occurrences to original df with join, and make NAs 0
df[df3, on = .(company, fruit, year), new_occurrences :=  i.new_occurrences]
df[is.na(new_occurrences), new_occurrences := 0]

#delete temp columns
df[, `:=`(fruit2 = NULL, year2 = NULL)]

Unfortunately this attempt does not give me my desired outcome.

Any help would be much appreciated, also solutions with dplyr are welcome! :)

Amleto
  • 584
  • 1
  • 7
  • 25
  • 2
    Is the first comma in first row after `peaches` a typo? Should it have been a semicolon? – Sotos Feb 18 '19 at 12:30

2 Answers2

1

A tidyverse attempt:

library(tidyverse)

years_window <- 5

df %>%
  separate_rows(fruit, sep = "; |, ") %>%
  mutate(tmp = 1, 
         year = as.integer(as.character(year))) %>%
  complete(company = unique(.$company),
           year = (min(year) - years_window):max(year), 
           fruit = unique(.$fruit)) %>%
  arrange(year) %>%
  group_by(company, fruit) %>%
  mutate(check = zoo::rollapply(tmp, 
                                FUN = function(x) sum(is.na(x)),
                                width = list(-(1:years_window)),
                                align = 'right',
                                fill = NA,
                                partial = TRUE)) %>% 
  group_by(company, year) %>% 
  mutate(new_occurrences = sum(check == years_window & !is.na(tmp))) %>%
  filter(!is.na(tmp)) %>%
  distinct(company, year, new_occurrences) %>% 
  arrange(year) %>%
  left_join(df %>% 
              mutate(year = as.integer(as.character(year))),
            by = c("company", "year")) %>%
  select(company, fruit, year, new_occurrences)

Output:

# A tibble: 6 x 4
# Groups:   company, year [6]
  company   fruit                                     year new_occurrences
  <fct>     <fct>                                    <int>           <int>
1 company_a peaches, apples; oranges                  2010               3
2 company_b apples; oranges; bananas                  2011               3
3 company_a bananas; apples; oranges; pears           2014               2
4 company_b oranges; pears                            2014               1
5 company_b apples; oranges; pears                    2016               0
6 company_a bananas; apples; oranges; pears; peaches  2018               1
arg0naut91
  • 14,574
  • 2
  • 17
  • 38
  • the order in my dataset is not alphabetical. Could you show me how to join with original df please? – Amleto Feb 18 '19 at 15:17
1

Assuming the input shown reproducibly in the Note at the end, define two functions to convert a semicolon separated string to a vector and back again. The for each row determine the previous fruit in last 5 years from current company and compute the required difference. In a second transform compute the number of new fruit. No packages are used.

char2vec <- function(x) scan(text = x, what = "", sep = ";", strip.white = TRUE, 
  quiet = TRUE)
vec2char <- function(x) paste(x, collapse = "; ")

df2 <- transform(df, new = sapply(1:nrow(df), function(i) {
  year0 <- df$year[i]; company0 <- df$company[i]; fruit0 <- df$fruit[i]
  prev_fruit <- char2vec(subset(df, 
    year < year0 & year >= year0 - 5 & company == company0)$fruit)
  vec2char(Filter(function(x) !x %in% prev_fruit, char2vec(fruit0)))
}), stringsAsFactors = FALSE)
transform(df2, num_new = lengths(lapply(new, char2vec)))

giving:

    company                                    fruit year                      new num_new
1 company_a                 peaches; apples; oranges 2010 peaches; apples; oranges       3
2 company_b                 apples; oranges; bananas 2011 apples; oranges; bananas       3
3 company_b                           oranges; pears 2014                    pears       1
4 company_a          bananas; apples; oranges; pears 2014           bananas; pears       2
5 company_b                   apples; oranges; pears 2016                                0
6 company_a bananas; apples; oranges; pears; peaches 2018                  peaches       1

Note

This is taken from question. One comma is changed to a semicolon.

df <- data.frame(company=c("company_a","company_b","company_b", 
     "company_a","company_b","company_a"), 
   fruit=c("peaches; apples; oranges","apples; oranges; bananas",
     "oranges; pears", "bananas; apples; oranges; pears", 
     "apples; oranges; pears", "bananas; apples; oranges; pears; peaches"),
   year = c("2010","2011","2014","2014", "2016","2018"))    
df[] <- lapply(df, as.character)
df$year <- as.numeric(df$year)
G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341