How can I find the most common sequences in my data using R?

Question

I'm trying to figure out how I can use the rollapply function (from the Zoo package) to find sequences of most common strings within a dataset, but I also need to do group certain variables (e.g. date, row, etc.)

Before I go any further, it's worth noting that this query builds on a question that I previously posted here : How can I find most common sequences (of strings) in my data using Tableau?

The solution offered there works really well, but I now want to apply it to a different dataset which provides some new challenges! Here's an example of the data that I'm working with in this new dataset:

structure(list(Title = c("Dragons' Den", "One Hot Summer", "Keeping Faith", 
"Cuckoo", "Match of the Day", "Sportscene", "Sportscene", "The Irish League Show", 
"Match of the Day", "EastEnders", "Dragons' Den", "Fake or Fortune?", 
"Asian Provocateur", "In The Flesh", "Two Pints of Lager and a Packet of Crisps", 
"Travels in Trumpland with Ed Balls", "Hidden", "Train Surfing Wars: A Matter of Life and Death", 
"Bollywood: The World's Biggest Film Industry", "One Hot Summer", 
"Asian Provocateur", "In The Flesh", "Two Pints of Lager and a Packet of Crisps", 
"Travels in Trumpland with Ed Balls", "EastEnders", "Match of the Day", 
"Dragons' Den", "The Next Step", "Doctor Who Series 11 Trailer", 
"Doctor Who", "Doctor Who", "Doctor Who", "Picnic at Hanging Rock", 
"Sylvia", "Keeping Faith", "Cardinal: Blackfly Season", "Picnic at Hanging Rock", 
"Age Before Beauty", "One Hot Summer", "Stewart Lee's Comedy Vehicle", 
"Asian Provocateur", "In The Flesh", "Two Pints of Lager and a Packet of Crisps", 
"Travels in Trumpland with Ed Balls", "EastEnders", "Age Before Beauty", 
"Holby City", "Who Do You Think You Are?", "Louis Theroux: Dark States", 
"Louis Theroux: Dark States", "Louis Theroux", "Louis Theroux's Weird Weekends", 
"Picnic at Hanging Rock", "Sylvia", "Keeping Faith", "Cardinal: Blackfly Season"
), Programme_Genre = c("Entertainment", "Documentary", "Drama", 
"New SeriesComedy", "Sport", "Sport", "Sport", "Sport", "Sport", 
"Drama", "Entertainment", "Documentary", "Comedy", "Drama", "Comedy", 
"Documentary", "Crime Drama", "Documentary", "Documentary", "Documentary", 
"Comedy", "Drama", "Comedy", "Documentary", "Drama", "Sport", 
"Entertainment", "CBBC", "Sci-Fi", "Sci-Fi", "Sci-Fi", "Sci-Fi", 
"Drama", "Film", "Drama", "Crime Drama", "On Now", "Drama", "Documentary", 
"Comedy", "Comedy", "Drama", "Comedy", "Documentary", "Drama", 
"Drama", "Drama", "History", "Documentary", "Documentary", "Documentary", 
"Archive", "Drama", "Film", "Drama", "Crime Drama"), Programme_Category = c("Featured", 
"Featured", "Featured", "Featured", "This Weekend's Football", 
"This Weekend's Football", "This Weekend's Football", "This Weekend's Football", 
"Most Popular", "Most Popular", "Most Popular", "Most Popular", 
"Box Sets", "Box Sets", "Box Sets", "Box Sets", "Featured", "Featured", 
"Featured", "Featured", "Box Sets", "Box Sets", "Box Sets", "Box Sets", 
"Most Popular", "Most Popular", "Most Popular", "Most Popular", 
"Doctor Who S1-S10", "Doctor Who S1-S10", "Doctor Who S1-S10", 
"Doctor Who S1-S10", "Drama", "Drama", "Drama", "Drama", "Featured", 
"Featured", "Featured", "Featured", "Box Sets", "Box Sets", "Box Sets", 
"Box Sets", "Most Popular", "Most Popular", "Most Popular", "Most Popular", 
"Louis Theroux", "Louis Theroux", "Louis Theroux", "Louis Theroux", 
"Drama", "Drama", "Drama", "Drama"), date = c("13/08/2018", "13/08/2018", 
"13/08/2018", "13/08/2018", "13/08/2018", "13/08/2018", "13/08/2018", 
"13/08/2018", "13/08/2018", "13/08/2018", "13/08/2018", "13/08/2018", 
"13/08/2018", "13/08/2018", "13/08/2018", "13/08/2018", "14/08/2018", 
"14/08/2018", "14/08/2018", "14/08/2018", "14/08/2018", "14/08/2018", 
"14/08/2018", "14/08/2018", "14/08/2018", "14/08/2018", "14/08/2018", 
"14/08/2018", "14/08/2018", "14/08/2018", "14/08/2018", "14/08/2018", 
"14/08/2018", "14/08/2018", "14/08/2018", "14/08/2018", "15/08/2018", 
"15/08/2018", "15/08/2018", "15/08/2018", "15/08/2018", "15/08/2018", 
"15/08/2018", "15/08/2018", "15/08/2018", "15/08/2018", "15/08/2018", 
"15/08/2018", "15/08/2018", "15/08/2018", "15/08/2018", "15/08/2018", 
"15/08/2018", "15/08/2018", "15/08/2018", "15/08/2018"), column = c("1", 
"2", "3", "4", "1", "2", "3", "4", "1", "2", "3", "4", "1", "2", 
"3", "4", "1", "2", "3", "4", "1", "2", "3", "4", "1", "2", "3", 
"4", "1", "2", "3", "4", "1", "2", "3", "4", "1", "2", "3", "4", 
"1", "2", "3", "4", "1", "2", "3", "4", "1", "2", "3", "4", "1", 
"2", "3", "4"), row = c("1", "1", "1", "1", "2", "2", "2", "2", 
"3", "3", "3", "3", "4", "4", "4", "4", "1", "1", "1", "1", "2", 
"2", "2", "2", "3", "3", "3", "3", "4", "4", "4", "4", "5", "5", 
"5", "5", "1", "1", "1", "1", "2", "2", "2", "2", "3", "3", "3", 
"3", "4", "4", "4", "4", "5", "5", "5", "5")), row.names = c(NA, 
-56L), class = "data.frame")

Apologies but I'm not quite sure about best practice for sharing data. Hope the above works. It should look something like this:

   Title            Programme_Genre     Programme_Category  date         column row
1   Dragons Den     Entertainment       Featured            13/08/2018      1   1
2  One Hot Summer   Documentary         Featured            13/08/2018      2   1
3  Keeping Faith    Drama               Featured            13/08/2018      3   1
4  Cuckoo           New Series Comedy   Featured            13/08/2018      4   1
5  Match of the Day Sport               This Weekends...    13/08/2018      1   2
6  Sportscene       Sport               This Weekends...    13/08/2018      2   2

What I want to do is to use the rollapply function similar to how it was suggested in my previous question (see link above) but only on looking for sequences that appear on the same date and across a certain range of columns. For example, I want to know what the most common sequence of genre ("Programme_Genre") is but I only want the rollapply function to do this across columns 1-4 for each row on each date. I'm sure I'm not explaining this very well (I don't come from a data science background in case you hadn't guessed) so I'm more than happy to elaborate if necessary. Thanks in advance!

Your sample data is great (it may look bad, but that's a good way to share it). What is your expected output? When what you want to do seems complex, it often helps significantly to take the sample data and manually determine what the output should look like. Even if you don't do it for all rows, doing it for 6 or so will help considerably. — r2evans, Apr 22 '21 at 17:44
Hello Japes. What do you mean by applying it to columns 1-4? Do you expect a result per Program_Genre for (1) 'Title', (2) Program_Genre, (3) Program Category, and (4) Date? That is, 4 outputs per Program Genre? Or do you want to group the values of Program Genre, Title, Program_Category, and Date and get one output out of every combination of the 4? — Nicolás Velasquez, Apr 22 '21 at 18:39
@r2evans It's hard to say what my expected output would be on the above as the sample is actually quite small but if I was looking at most common sequences of 4, then the combination of "sport" "sport" "sport" "sport" would appear once (in the data provided above - it appears in rows 5 to 8 of the data frame and is part of the same row "2"). Does that help? — Japes, Apr 22 '21 at 18:52
@NicolásVelásquez Sorry - two of my variables are called "column" and "row" which is probably creating some confusion! Let me try and explain it again: I want to search for the most common sequences of 4 genres (or 3, or 2, but let's say 4 for now). However, I only want to look for this sequence in individual "rows" (which are number 1-4 in the "column" variable). Is that any clearer? — Japes, Apr 22 '21 at 18:55
Japes, if it's hard for you to say what the output *should* be, then how do you know if any efforts are correct? Getting an output is one thing, getting a *correct* output is another altogether ... and getting an incorrect (but good-looking) output is misleading at best, silently-corrupt data otherwise. — r2evans, Apr 22 '21 at 19:19
I'm a bit confused by this reply. When you say "what should the output be?" do you mean the format of the output or the actual data (i.e. the most common combination of genres and the number of times that combination appears in the data)? If the latter, I don't quite see how I'm supposed to know that (it's also hard to discern just by looking at the sample data). My goal is simply to ask and find out: what is the most common combination of genre (and how many times do those combinations appear in the data). Apologies if I misunderstood your comment. — Japes, Apr 22 '21 at 19:45

Nicolás Velasquez · Answer 1 · 2021-04-22T22:32:05.760

With tidyverse, zoo and lubridate, try:

library(tidyverse)
library(zoo)
library(lubridate)

df %>% 
  mutate(date = lubridate::dmy(date)) %>% # Optional. Properly parses date as Date class. Makes sorting easier.
  filter(column <= 4) %>% # Step 1. Exclude observations with `column` values above 4.
  group_split(row, date) %>% # Step 2. Splits the DF into smaller DFs representing row and date groups.
  # Step 3 (below). Loops the solution to the previous question, gets a DF, and assigns the date and row signals to each observation.
  map_df(.x = . ,
         .f = ~(rollapply(data = .x$Programme_Genre , 3, c) %>% 
                  as_tibble() %>% 
                  mutate(date = unique(.x$date), row = unique(.x$row)))) %>% 
  group_by_all() %>% 
  tally() %>% 
  arrange(date, row, n)

    # A tibble: 26 x 6
# Groups:   V1, V2, V3, date [26]
   V1            V2            V3               date       row       n
   <chr>         <chr>         <chr>            <date>     <chr> <int>
 1 Documentary   Drama         New SeriesComedy 2018-08-13 1         1
 2 Entertainment Documentary   Drama            2018-08-13 1         1
 3 Sport         Sport         Sport            2018-08-13 2         2
 4 Drama         Entertainment Documentary      2018-08-13 3         1
 5 Sport         Drama         Entertainment    2018-08-13 3         1
 6 Comedy        Drama         Comedy           2018-08-13 4         1
 7 Drama         Comedy        Documentary      2018-08-13 4         1
 8 Crime Drama   Documentary   Documentary      2018-08-14 1         1
 9 Documentary   Documentary   Documentary      2018-08-14 1         1
10 Comedy        Drama         Comedy           2018-08-14 2         1
# ... with 16 more rows

Thanks Nicolás. This looks promising. I won't get a chance to test this until tomorrow but will report back once I've had a go! Thanks for taking the time to reply to this. — Japes, Apr 22 '21 at 19:46
Ok, finally got a chance to try this, but I'm getting the following error message: "Error: cannot arrange column of class 'function' at position 1". Any ideas? I've had a search for the answer but can't figure it out! — Japes, Apr 26 '21 at 12:30

AnilGoyal · Answer 2 · 2021-04-26T12:49:40.487

1

In this case also, I suggest you a similar strategy suggested in linked question.

Firstly load the libraries

library(tidyverse)
library(runner)

Strategy for say n=3

n <- 3

data %>% 
  group_by(date) %>%
  mutate(l_seq = runner(x = Programme_Genre, 
                        k = n, 
                        function(x) ifelse(length(x) == n, list(x), list(rep(NA, n)))
  )
  ) %>%
  ungroup() %>%
  group_split(date) %>%
  map_df(., ~ map_df(.x$l_seq, ~setNames(.x, paste0('Col', seq_len(n)))) %>%
           mutate(date = .x$date) %>% 
           na.omit() %>%
           group_by_all() %>%
           summarise(m = n(), .groups = 'drop') %>%
           filter(m == max(m) & m > 1)
  )

# A tibble: 2 x 5
  Col1   Col2   Col3   date           m
  <chr>  <chr>  <chr>  <chr>      <int>
1 Sport  Sport  Sport  13/08/2018     3
2 Sci-Fi Sci-Fi Sci-Fi 14/08/2018     2

Needless to say m is the column giving you maximum count of sequence on that particular date

say if n=4, the above syntax gives you following results

# A tibble: 1 x 6
  Col1  Col2  Col3  Col4  date           m
  <chr> <chr> <chr> <chr> <chr>      <int>
1 Sport Sport Sport Sport 13/08/2018     2

There is no sequence of length more than 1 for length 5 in the sample

edited Apr 26 '21 at 12:49

answered Apr 23 '21 at 06:15

AnilGoyal

25,297
4
27
45

Hi @AnilGoyal. Thanks for the help. I've tried the code but I get the following error: "Error: Argument 1 must have names". Any ideas? – Japes Apr 26 '21 at 12:37
have you created `cols` as shown in example. Or should I modify my code? – AnilGoyal Apr 26 '21 at 12:39
Hi, yes, I created `cols` as per your code. I don't want to take advantage if your generosity but if you did know the solution, that would be amazing :) – Japes Apr 26 '21 at 12:47
Thanks for trying - but still getting the same error! :( – Japes Apr 26 '21 at 12:58
Hi Anil, sorry, I'm not sure what you mean. Are you able to elaborate / explain it to me as if I'm totally new to R... which is not far from the truth! – Japes Apr 26 '21 at 14:02
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/231614/discussion-between-anilgoyal-and-japes). – AnilGoyal Apr 26 '21 at 14:07
@Japes, I remember we had already Chat earlier also. Did this issue solve? or is still unsolved? – AnilGoyal Jul 07 '21 at 15:36
1

I honestly can't remember where I got with this. I think I managed to come up with a fix, but it wasn't perfect. However, at some point soon I'll be coming back to this particularly project so will update you when I know more :) – Japes Jul 08 '21 at 09:49

How can I find the most common sequences in my data using R?

2 Answers2

Linked