3

I have a data.frame where each ID has exactly 3 attributes. For simplification I put only 100 rows, although in my real dataset it's around a 1.000.000. There are around 50 different possible attributes. The attributes are a mixture out of numbers and characters.

data <- data.frame(id = 1:100,
               a1 = sample(letters,100,replace = T),
               a2 = sample(letters,100,replace = T),
               a3 = sample(letters,100,replace = T),
               stringsAsFactors=FALSE) %>% 
               as_tibble()

I want to know what are the most frequent combinations (the order does not matter)

So the outcome is supposed to be something like this

pattern | frequency
a,a,a   |  10
A,b,c   |  5
a,e,c   |  4
...     |  ....

First I started to create a vector which contains all possible combinations:

possible_combinations <- combn(c(letters,LETTERS),3) %>% 
   t() %>% 
   as_tibble() %>%
   unite("combination",sep="") %>% 
   pull()

Then I wrote this nested loop to count the frequencies:

 counter = 0
 inner_counter = 0
 combination_counter = vector(mode = "numeric",length = length (possible_combinations))

  for (j in 1:length(possible_combinations)){
    for (i in 1:nrow(data)){

        # inner Counter Counts when Attribute of one ID is in one combination
        inner_counter = inner_counter + str_count(possible_combinations[j] , data[[i,2]] )
        inner_counter = inner_counter + str_count(possible_combinations[j] , data[[i,3]] )
        inner_counter = inner_counter + str_count(possible_combinations[j] , data[[i,4]] )

      # if all three attributes are in a combination, then the Counter increases by one 
    if(inner_counter == 3) {
       counter = counter + 1 }
       inner_counter = 0
                            }

  # combination_counter is a vector which saves the frequency with 
  # which a combination ocurred in all different ids

  combination_counter[[j]] = inner_counter
  inner_counter = 0 
 }

I know this is really not very R like, but I don't know how to do it in a different way. The runtime is even bad for my little toy example and it's almost infeasible for my real data.

MatzeKnop
  • 163
  • 1
  • 8
  • So just to clarify, 'aab' = 'aba'? What about 'Aba'? – c.custer Nov 27 '18 at 16:47
  • How many unique elements do you have in your real data? For example, with your example data, there are exactly 26 (i.e. `length(unique(c(data$a1, data$a2, data$a3)))` returns `26`). – Joseph Wood Nov 28 '18 at 02:03
  • in my real data I have 53 unqiue elements, but some arguments occur much more frequently than others. I just did it with Lennyyys approach and it resulted in around 8000 combinations and it was relatively fast – MatzeKnop Nov 28 '18 at 11:00

4 Answers4

2

You could as well do this with base r:

table(apply(data[,2:4], 1, function(x) paste0(sort(x), collapse = ",")))
Lennyy
  • 5,932
  • 2
  • 10
  • 23
2

The problem that you are going to run into is dealing with the massive number of combinations. Even if you try to apply a simple solution of sorting each row, this will cost a lot of time for the number of rows you are dealing with.

Take the following example with the straightforward approach offered by @Lennyy:

set.seed(123)
n <- 1e7

data <- data.frame(id = 1:n,
                   a1 = sample(letters, n, replace = T),
                   a2 = sample(letters, n, replace = T),
                   a3 = sample(letters, n, replace = T),
                   stringsAsFactors = FALSE)

system.time(t2 <- table(apply(data[,2:4], 1, function(x) paste0(sort(x), collapse = ","))))
   user  system elapsed 
373.281   1.695 375.445

That's a long time...

Here is the output for reference:

head(t2)

a,a,a a,a,b a,a,c a,a,d a,a,e a,a,f 
  603  1657  1620  1682  1759  1734

We need to somehow code each row quickly without worrying about which column a particular element came from. Additionally, we need to do this in a way that will guarantee uniqueness.

What about a hash table? We can easily do this with Rcpp.

#include <Rcpp.h>
#include <unordered_map>
using namespace Rcpp;

// [[Rcpp::plugins(cpp11)]]

// [[Rcpp::export]]
IntegerVector countCombos(IntegerMatrix myMat, int numAttr, CharacterVector myAttr) {

    unsigned long int numRows = myMat.nrow();
    unsigned long int numCols = myMat.ncol();
    std::unordered_map<std::string, int> mapOfVecs;

    for (std::size_t i = 0; i < numRows; ++i) {
        std::vector<int> testVec(numAttr, 0);

        for (std::size_t j = 0; j < numCols; ++j) {
            ++testVec[myMat(i, j) - 1];
        }

        std::string myKey(testVec.begin(), testVec.end());

        auto it = mapOfVecs.find(myKey);

        if (it == mapOfVecs.end()) {
            mapOfVecs.insert({myKey, 1});
        } else {
            ++(it->second);
        }
    }

    std::size_t count = 0;
    IntegerVector out(mapOfVecs.size());
    CharacterVector myNames(mapOfVecs.size());

    for (const auto& elem: mapOfVecs) {
        std::size_t i = 0;
        for (auto myChar: elem.first) {
            while (myChar) {
                myNames[count] += myAttr[i];
                --myChar;
            }
            ++i;
        }
        out[count++] = elem.second;
    }

    out.attr("names") = myNames;

    return out;
}

This offers a great efficiency gain over any of the other solutions posted:

myRows <- 1:nrow(data)
attrCount <- 26

matOfInts <- vapply(2:ncol(data), function(x) {
    match(data[, x], letters)
}, myRows, USE.NAMES = FALSE)
system.time(t <- countCombos(matOfInts, attrCount, letters))
 user  system elapsed 
2.570   0.007   2.579

That's over 100 times faster!!!!

And here is the output:

head(t)
 jkk  ddd  qvv  ttu  aaq  ccd 
1710  563 1672 1663 1731 1775

Testing equality (the output is in different order, so we must sort first):

identical(sort(unname(t)), as.integer(sort(unname(t2))))
[1] TRUE

Explanation

The countCombos function accepts a matrix of integers. This matrix represents the indices of elements of the unique attributes (in our example, this would be represented by letters).

As we are dealing with combinations with repetition, we can easily represent them as an indexing frequency vector.

The template vector is:

 a   b   c   d   e       y   z
 |   |   |   |   |       |   |
 v   v   v   v   v       v   v
(0,  0,  0,  0,  0, ...  0,  0)

And here is how certain combinations get mapped:

aaa -->> (3, rep(0, 25))
zdd -->> dzd -->> ddz -->> (0, 0, 0, 2, rep(0, 21), 1)

Once we have created our vector, we convert it to a string, so ddz becomes:

ddz --> c((0,0,0,2, rep(0, 21),1) -->> `00020000000000000000000001`

And this is the key that is used in our hash.

Joseph Wood
  • 7,077
  • 2
  • 30
  • 65
  • Hey Joseph thanks for the detailed answer! How do I load the function?When I load the Rcpp package and run it, it throws out an error. – MatzeKnop Nov 28 '18 at 09:05
  • @MatzeKnop, if you are using RStudio, go to File —- New File —— C++ File. Put `countCombos` in that file (along with the #includes and // [[Rcpp..). Source that and you should be good to go. If you are in Windows, you will need Rtools. There is more info here http://adv-r.had.co.nz/Rcpp.html – Joseph Wood Nov 28 '18 at 13:02
1

If I've understood you correctly the ordering of the attributes doesn't matter, so aba is the same as aab and baa. You also have 50 different attributes and all other solutions seems to rely on typing these in manually.

The following code creates a column that is the concatenated of all attribute columns, sorts it to ignore the order of the attributes, and the calculates the count per group:

library(dplyr)
library(rlang)
cnames <- colnames(data)
cnames <- cnames[2:length(cnames)] #assuming the first column is the only non-attribute column,
#remove any other non-attribute columns as necessary

#!!!syms(cnames) outputs them as the columns rather than text, taken from here
# https://stackoverflow.com/questions/44613279/dplyr-concat-columns-stored-in-variable-mutate-and-non-standard-evaluation?rq=1
data %>% 
  mutate(comb = sort(paste0(!!!syms(cnames)))) %>% 
  group_by(comb) %>% 
  summarise(cnt = n())
Ryan Hill
  • 136
  • 1
  • 8
  • When testing this on the data set I provided in my answers, there should only be 3276 combinations, but your code returns `26^3 = 17576`. – Joseph Wood Nov 28 '18 at 22:21
  • Also when testing on a smaller data set (as is in the OP's question.. although tough to replicate because the seed hasn't been set), the results are not correct. This looks like a promising solution, but I don't know enough about `dplyr` to offer any advice. I hope this helps. – Joseph Wood Nov 28 '18 at 22:28
0

You can use dplyr to do this efficiently. First use group_by to group variables a1, a2, and a3, then use summarize and n() to count frequencies:

set.seed(100)
N = 1e5
data <- data.frame(id = 1:N,
                   a1 = sample(letters[1:5],N,replace = T),
                   a2 = sample(letters[1:5],N,replace = T),
                   a3 = sample(letters[1:5],N,replace = T),
                   stringsAsFactors=FALSE)
data %>%
  group_by(a1, a2, a3) %>%
  summarize(count = n()) %>%
  arrange(count)

## A tibble: 125 x 4
## Groups:   a1, a2 [25]
#   a1    a2    a3    count
#   <chr> <chr> <chr> <int>
# 1 b     a     d       735
# 2 c     b     d       741
# 3 a     d     e       747
# 4 d     a     e       754
# 5 d     e     e       754
# 6 d     e     c       756
# 7 e     a     d       756
# 8 d     c     d       757
# 9 c     c     c       758
#10 d     a     b       759
## ... with 115 more rows
tkmckenzie
  • 1,353
  • 1
  • 10
  • 19
  • 1
    This also doesn't appear to work in regards to 'order doesn't matter'. My simple test run counted 'aab' separately from 'baa' – c.custer Nov 27 '18 at 16:52
  • thanks for your answer. But c.custer is right the order does not matter. 'aab' is supposed to be the same as 'baa' – MatzeKnop Nov 28 '18 at 09:09