I have a data.frame where each ID has exactly 3 attributes. For simplification I put only 100 rows, although in my real dataset it's around a 1.000.000. There are around 50 different possible attributes. The attributes are a mixture out of numbers and characters.
data <- data.frame(id = 1:100,
a1 = sample(letters,100,replace = T),
a2 = sample(letters,100,replace = T),
a3 = sample(letters,100,replace = T),
stringsAsFactors=FALSE) %>%
as_tibble()
I want to know what are the most frequent combinations (the order does not matter)
So the outcome is supposed to be something like this
pattern | frequency
a,a,a | 10
A,b,c | 5
a,e,c | 4
... | ....
First I started to create a vector which contains all possible combinations:
possible_combinations <- combn(c(letters,LETTERS),3) %>%
t() %>%
as_tibble() %>%
unite("combination",sep="") %>%
pull()
Then I wrote this nested loop to count the frequencies:
counter = 0
inner_counter = 0
combination_counter = vector(mode = "numeric",length = length (possible_combinations))
for (j in 1:length(possible_combinations)){
for (i in 1:nrow(data)){
# inner Counter Counts when Attribute of one ID is in one combination
inner_counter = inner_counter + str_count(possible_combinations[j] , data[[i,2]] )
inner_counter = inner_counter + str_count(possible_combinations[j] , data[[i,3]] )
inner_counter = inner_counter + str_count(possible_combinations[j] , data[[i,4]] )
# if all three attributes are in a combination, then the Counter increases by one
if(inner_counter == 3) {
counter = counter + 1 }
inner_counter = 0
}
# combination_counter is a vector which saves the frequency with
# which a combination ocurred in all different ids
combination_counter[[j]] = inner_counter
inner_counter = 0
}
I know this is really not very R like, but I don't know how to do it in a different way. The runtime is even bad for my little toy example and it's almost infeasible for my real data.