Sum of occurrences of words with multiple conditions

Question

I wish to get the sum of the occurrences of the fruit produced by each company in the previous five years and to construct a new column with all the sums.

For example: in 2016 company_b produced apples; oranges; pears while in the previous 5 years company_b produced (2011: apples; oranges; bananas) and (2014: oranges; pears). By counting the number of fruit produced in the previous five years which correspond to the focal (2016) year we get 4.

In my search for an answer I have only seen the sum of occurrences for numbers like in this post R: calculate the number of occurrences of a specific event in a specified time future . However, I need to count the appearances of all the words for any given company in the previous five years.

Any help would be much appreciated, also any solutions using dplyr is welcome! :)

df <- data.frame(company=c("company_a","company_b","company_b", "company_a","company_b","company_a"), 
             fruit=c("peaches, apples; oranges","apples; oranges; bananas","oranges; pears","bananas; apples; oranges; pears","apples; oranges; pears","bananas; apples; oranges; pears; peaches"),
             year=c("2010","2011","2014","2014", "2016","2018"))    

> df
    company                                    fruit year
1 company_a                 peaches, apples; oranges 2010
2 company_b                 apples; oranges; bananas 2011
3 company_b                           oranges; pears 2014
4 company_a          bananas; apples; oranges; pears 2014
5 company_b                   apples; oranges; pears 2016
6 company_a bananas; apples; oranges; pears; peaches 2018

The resulting column should look like this:

df <-  cbind(df, c("0","0","1","2","4","4") 

company                                    fruit year      sum_occurrences
1 company_a                 peaches, apples; oranges 2010               0
2 company_b                 apples; oranges; bananas 2011               0
3 company_b                           oranges; pears 2014               1
4 company_a          bananas; apples; oranges; pears 2014               2
5 company_b                   apples; oranges; pears 2016               4
6 company_a bananas; apples; oranges; pears; peaches 2018               4

It's not clear to me how the desired output is carried out. Could you describe it better? — nicola, Jan 22 '19 at 15:40
Agreed, more explanation needed. It might be easier to see if the output was sorted by company and then year, but I *think* what OP wants is a *rolling sum over a 5-year window, by company, of terms duplicated terms in `fruit`*. Company A 2010 = 0 because first year. Company A 2014 = 2 because apples and oranges duplicated from 2010 (within 5 years). Company A 2018 = 4 because bananas, apples, oranges, pears duplicated in 2014 (within 5 years)---peaches not considered a duplicate from 2010 because it is >5 years ago. **OP, is this correct?** — Gregor Thomas, Jan 22 '19 at 15:46
Have you considered using separate_rows from tidyr, and then summarizing the results? — Phill, Jan 22 '19 at 16:01
@Gregor, yes that is correct. That is what I am looking for, probably a _rolling sum_. — Amleto, Jan 22 '19 at 16:01
i think that you'd find this a much easier task if youd reorganize your data. — Phi, Jan 22 '19 at 16:12

IceCreamToucan · Accepted Answer · 2019-01-22T16:30:51.567

# clean up column classes
df[] <- lapply(df, as.character)
df$year <- as.numeric(df$year)

library(data.table)
setDT(df)

# create separate column for vector of fruits, and year + 5 column
df[, fruit2 := strsplit(gsub(' ', '', fruit), ',|;')]
df[, year2 := year + 5]

# Self join so for each row of df, this creates one row for each time another  
# row is within the year range 
df2 <- df[df, on = .(year <= year2, year > year, company = company)
          , .(company, fruit, fruit2, i.fruit2, year = x.year)]

# For each row in the (company, fruit, year) group, check whether 
# the original fruits are  in the matching rows' fruits, and store the result
# as a logical vector. Then sum the list of logical vectors (one for each row).
df3 <- df2[, .(sum_occurrences = do.call(sum, Map(`%in%`, fruit2, i.fruit2)))
           , by = .(company, fruit, year)]

# Add sum_occurrences to original df with join, and make NAs 0
df[df3, on = .(company, fruit, year), sum_occurrences := i.sum_occurrences]
df[is.na(sum_occurrences), sum_occurrences := 0]

#delete temp columns
df[, `:=`(fruit2 = NULL, year2 = NULL)]

Result

df


#      company                                    fruit year sum_occurrences
# 1: company_a                 peaches, apples; oranges 2010               0
# 2: company_b                 apples; oranges; bananas 2011               0
# 3: company_b                           oranges; pears 2014               1
# 4: company_a          bananas; apples; oranges; pears 2014               2
# 5: company_b                   apples; oranges; pears 2016               4
# 6: company_a bananas; apples; oranges; pears; peaches 2018               4

Thanks @IceCreamToucan, I get this error message when running df2: `in Error in vecseq(f__, len__, if (allow.cartesian || notjoin || !anyDuplicated(f__,` : do you know what it means? — Amleto, Jan 22 '19 at 18:50
Is that the full error? Maybe you need to add `, allow.cartesian = TRUE` insize the brackets (before `]`) — IceCreamToucan, Jan 22 '19 at 18:56
Thank you @IceCreamToucan this worked perfectly. By any chance, would you know how could I get only the fruit that have never been produced before by the company? for example: in 2018 company_a produced for the first time peaches, thus sum_occurrences(2018) = 1 ? — Amleto, Jan 22 '19 at 20:31

Sum of occurrences of words with multiple conditions

1 Answers1

Linked