How can I find most common sequences (of strings) in my data using Tableau?

Question

Usual disclaimer: I'm very much a novice when it comes to Tableau (and R, which is my preferred data wrangling language).

Here's what I'm trying to do:

I have a dataset which has multiple variables, two of which are "time" and "genre". Here's an example of what the data looks like:


Index      Title              Date       Genre             Time
    1      Sherlock           01/01/20   Drama             21:00
    2      Peaky Blinders     01/01/20   Drama             20:00
    3      Eastenders         01/01/20   Drama             19:30
    4      BBC News           01/01/20   News              18:30
    5      Antiques Roadshow  01/01/20   Factual           18:00
    6      Peaky Blinders     02/01/20   Drama             21:00
    7      Casualty           02/01/20   Drama             20:00
    8      Eastenders         02/01/20   Drama             19:30
    9      BBC News           02/01/20   News              18:30
   10      Dragons Den        02/01/20   Entertainment     18:00

This is just a very small sample from a very large dataset, but what I'm trying to determine is: what are the most common combination/sequence of genres? For example, in the data above, the most common sequence of three would be "drama + drama + drama". The most common sequence of four would be "news + drama + drama + drama".

My data has thousands of dates (it's the BBC One broadcast schedule in case you were wondering) and I want to find out what the most common combination / sequence of genres are (of at least 3).

I wonder if this is too complex for Tableau and something I would need to do in R instead? Any advice would be most welcome! And as always, I'm happy to elaborate on anything that isn't clear.

Hi Japes, questions about how to use Tableau to analyze data are not on topic for Stack Overflow, that is why the [tag:Tableau] [tag was nuked](https://meta.stackoverflow.com/questions/251957/tag-request-tableau-api). If you're interested in using a programming technique, I would suggest refocusing on R. — Ian Campbell, Apr 12 '21 at 19:40

score 1 · Answer 1 · answered Apr 12 '21 at 19:55

If you looking at large volumes of data where the order/sequence of data rows is critical to analysis, you would do well to learn about SQL windowing queries, regardless of the visualization tool you use.

The simplest approach would be to start with a pre-processing step to feed simplified data to your visualization layer, either using Python, R, possibly with Tableau-Prep or some other tool. Doing that is easier if you can very clearly define the information you need to display at the end and work backwards to determine what the pre-processing step should produce to enable that viz.

If preparing data up front is too static, doesn't allow you to interactively explore as desired, you can then pull some of the pre-processing into the interactive layer, perhaps using Tableau's custom SQL or pass-through functions to call some of the same SQL you developed in the pre-processing step.

In any case, windowing queries aka analytic queries may be useful for this.

Hey Alex, thanks for the feedback. I had a feeling this might be too complex for Tableau. As I don't use Prep (just haven't had time to learn it) I'll probably just do this in R. But as you note, it would be helpful to be able to do this stuff in real time! Thanks anyway for the feedback :) — Japes, Apr 13 '21 at 10:27

score 1 · Accepted Answer · answered Apr 12 '21 at 20:09

1

Since a Tableau approach would be off topic, let's consider R:

We can use a rolling function to determine all sequences of 3. The zoo package has a rollapply function:

library(zoo)
rollapply(data$Genre,3,c)
#     [,1]      [,2]      [,3]           
#[1,] "Drama"   "Drama"   "Drama"        
#[2,] "Drama"   "Drama"   "News"         
#[3,] "Drama"   "News"    "Factual"      
#[4,] "News"    "Factual" "Drama"        
#[5,] "Factual" "Drama"   "Drama"        
#[6,] "Drama"   "Drama"   "Drama"        
#[7,] "Drama"   "Drama"   "News"         
#[8,] "Drama"   "News"    "Entertainment"

There are plenty of ways to go from here, but I prefer dplyr:

library(dplyr)
rollapply(data$Genre,3,c) %>%
   as_tibble() %>%
   group_by_all() %>%
   tally()
#  V1      V2      V3                n
#  <chr>   <chr>   <chr>         <int>
#1 Drama   Drama   Drama             2
#2 Drama   Drama   News              2
#3 Drama   News    Entertainment     1
#4 Drama   News    Factual           1
#5 Factual Drama   Drama             1
#6 News    Factual Drama             1

Data:

data <- structure(list(Index = 1:10, Title = c("Sherlock", "Peaky Blinders", 
"Eastenders", "BBC News", "Antiques Roadshow", "Peaky Blinders", 
"Casualty", "Eastenders", "BBC News", "Dragons Den"), Date = c("01/01/20", 
"01/01/20", "01/01/20", "01/01/20", "01/01/20", "02/01/20", "02/01/20", 
"02/01/20", "02/01/20", "02/01/20"), Genre = c("Drama", "Drama", 
"Drama", "News", "Factual", "Drama", "Drama", "Drama", "News", 
"Entertainment"), Time = c("21:00", "20:00", "19:30", "18:30", 
"18:00", "21:00", "20:00", "19:30", "18:30", "18:00")), class = "data.frame", row.names = c(NA, 
-10L))

answered Apr 12 '21 at 20:09

Ian Campbell

23,484
14
36
57

I had a feeling that this would be too complex for Tableau, but this looks like a good solution. I'll try it out today/tomorrow and let you know how I get on. – Japes Apr 13 '21 at 10:31
@Ian, this is a very good solution, but what if all the sequences of length >= 3 are required – AnilGoyal Apr 13 '21 at 16:27
No, your solution strategy is perfect. I want just wondering if we can do it to find seq of max length possible i.e. in this case 4. – AnilGoyal Apr 13 '21 at 16:39
1

@AnilGoyal Ah, I see, yes, I worry that this solution is highly memory inefficient and will fail at large lengths. If the question author has a specific problem they run into, then I might think more. Otherwise, I'm happy to upvote your alternative solution that is more efficient. – Ian Campbell Apr 13 '21 at 16:41
1

happy to note that. But I am still figuring out a strategy to do it. :) – AnilGoyal Apr 13 '21 at 16:42
Ok, this worked perfectly, thanks! But I now how a more complex problem: is it possible to limit this function to particular dates? E.g. only search for sequences on the same date (rather than also across dates?) – Japes Apr 22 '21 at 16:49
1

Yes, you can use the `group_by` function on `Date`, but that sounds like a different question. – Ian Campbell Apr 22 '21 at 16:51
I'm applying the approach you suggested on another dataset (which is slightly different), and I need to group by date AND row number (there are four rows for each date, each row is numbered 1-4). Maybe I need to write a new question for this?! – Japes Apr 22 '21 at 16:56
1

Yes, as a new question. You can provide a link to this one for context if it helps. Remember to post a representative sample of your data (like you did here), your code thus far, and your expected output for the best chance of a helpful answer. – Ian Campbell Apr 22 '21 at 16:57
ok, done: https://stackoverflow.com/questions/67218140/how-can-i-find-the-most-common-sequences-in-my-data-using-r (hope it's clear!) – Japes Apr 22 '21 at 17:32

score 0 · Answer 3 · answered Apr 23 '21 at 05:37

To do it in r you may also use package runner. For checking it for variable lengths, use a strategy like this.

n <- 3
cols <- paste0('Col', seq_len(n))

library(runner)
map_dfr(runner(x=data$Genre, k=n, f = function(x) ifelse(length(x) == n, list(x), list(rep(NA, n)))), 
        ~ setNames(.x, cols)) %>%
  na.omit() %>%
  group_by_all() %>%
  summarise(m = n()) %>%
  ungroup() %>%
  filter(m == max(m))

# A tibble: 2 x 4
  Col1  Col2  Col3      m
  <chr> <chr> <chr> <int>
1 Drama Drama Drama     2
2 Drama Drama News      2

For length 4 just change n and

n <- 4
cols <- paste0('Col', seq_len(n))

map_dfr(runner(x=data$Genre, k=n, f = function(x) ifelse(length(x) == n, list(x), list(rep(NA, n)))), 
        ~ setNames(.x, cols)) %>%
  na.omit() %>%
  group_by_all() %>%
  summarise(m = n()) %>%
  ungroup() %>%
  filter(m == max(m))
# A tibble: 1 x 5
  Col1  Col2  Col3  Col4      m
  <chr> <chr> <chr> <chr> <int>
1 Drama Drama Drama News      2

Data used

> data
   Index             Title     Date         Genre  Time
1      1          Sherlock 01/01/20         Drama 21:00
2      2    Peaky Blinders 01/01/20         Drama 20:00
3      3        Eastenders 01/01/20         Drama 19:30
4      4          BBC News 01/01/20          News 18:30
5      5 Antiques Roadshow 01/01/20       Factual 18:00
6      6    Peaky Blinders 02/01/20         Drama 21:00
7      7          Casualty 02/01/20         Drama 20:00
8      8        Eastenders 02/01/20         Drama 19:30
9      9          BBC News 02/01/20          News 18:30
10    10       Dragons Den 02/01/20 Entertainment 18:00

How can I find most common sequences (of strings) in my data using Tableau?

3 Answers3

Linked