extracting unique combinations from a long list of binary variables

Question

I have a dataframe containing a long list of binary variables. Each row represents a participant, and columns represent whether a participant made a certain choice (1) or not (0). For the sakes of simplicity, let's say there's only four binary variables and 6 participants.

df <- data.frame(a = c(0,1,0,1,0,1),
                 b = c(1,1,1,1,0,1),
                 c = c(0,0,0,1,1,1),
                 d = c(1,1,0,0,0,0))

>df

#   a b c d
# 1 0 1 0 1
# 2 1 1 0 1
# 3 0 1 0 0
# 4 1 1 1 0
# 5 0 0 1 0
# 6 1 1 1 0

In the dataframe, I want to create a list of columns that reflect each unique combination of variables in df (i.e., abc, abd, bcd, cda). Then, for each row, I want to add value "1" if the row contains the particular combination corresponding to the column. So, if the participant scored 1 on "a", "b", and "c", and 0 on "d" he would have a score 1 in the newly created column "abc", but 0 in the other columns. Ideally, it would look something like this.

>df_updated

#   a b c d abc abd bcd cda
# 1 0 1 0 1   0   0   0   0
# 2 1 1 0 1   0   1   0   0
# 3 0 1 0 0   0   0   0   0
# 4 1 1 1 0   1   0   0   0
# 5 0 0 1 0   0   0   0   0
# 6 1 1 1 0   0   0   0   0

The ultimate goal is to have an idea of the frequency of each of the combinations, so I can order them from the most frequently chosen to the least frequently chosen. I've been thinking about this issue for days now, but couldn't find an appropriate answer. I would very much appreciate the help.

I'm not really sure how your expected output gives you any useful new information over the raw data, truth be told. You've turned 4 input columns into 4 output columns which are--in my opinion--even harder to understand than the raw data. What happens if you have 5, 6 or 10 input columns? The number of different output combinations of those fields will get large very quickly. — Simon, Mar 28 '21 at 11:02
To me this sounds that you maybe are looking for "item sets". If so, `apriori::arules` is an efficient tool, especially if you will have more than only combinations of three (as mentioned by @Simon). See e.g. [Count common sets of items between different customers](https://stackoverflow.com/a/63348925), where you already have a "binary incidence matrix". Adjust `support` and `minlen` as desired. — Henrik, Mar 28 '21 at 11:14
For more background, please check the nice [`arules` vignette](https://cran.r-project.org/web/packages/arules/vignettes/arules.pdf) — Henrik, Mar 28 '21 at 11:25
Why don't you write up a solution based on this, would be nice to see — Sirius, Mar 28 '21 at 13:46
@Sirius Thanks, indeed, but _if_ OP really wants item sets, the question would be a duplicate of the link (and then no need to reiterate that answer here). And _if_ OP wants exactly what is described in the question, there are already some answers. Cheers — Henrik, Mar 28 '21 at 14:12

score 1 · Answer 1 · answered Mar 28 '21 at 12:21

Something like this?

funCombn <- function(data){
  f <- function(x, data){
    data <- data[x]
    list(
      name = paste(x, collapse = ""),
      vec = apply(data, 1, function(x) +all(as.logical(x)))
    )
  }
  
  res <- combn(names(df), 3, f, simplify = FALSE, data = df)
  out <- do.call(cbind.data.frame, lapply(res, '[[', 'vec'))
  names(out) <- sapply(res, '[[', 'name')
  cbind(data, out)
}

funCombn(df)
#  a b c d abc abd acd bcd
#1 0 1 0 1   0   0   0   0
#2 1 1 0 1   0   1   0   0
#3 0 1 0 0   0   0   0   0
#4 1 1 1 0   1   0   0   0
#5 0 0 1 0   0   0   0   0
#6 1 1 1 0   1   0   0   0

score 1 · Answer 2 · answered Mar 28 '21 at 13:32

Base R option using combn :

n <- 3
cbind(df, do.call(cbind, combn(names(df), n, function(x) {
  setNames(data.frame(as.integer(rowSums(df[x] == 1) == n)), 
           paste0(x, collapse = ''))
}, simplify = FALSE))) -> result

result

#  a b c d abc abd acd bcd
#1 0 1 0 1   0   0   0   0
#2 1 1 0 1   0   1   0   0
#3 0 1 0 0   0   0   0   0
#4 1 1 1 0   1   0   0   0
#5 0 0 1 0   0   0   0   0
#6 1 1 1 0   1   0   0   0

Using combn create all combinations of column names taking n columns at a time. For each of those combinations assign 1 to those rows where all the 3 combinations are 1 or 0 otherwise.

score 0 · Answer 3 · answered Mar 28 '21 at 12:21

If you are just looking for a frequency of the combinations (and they don't need to be back in the original data), then you could use something like this:

df <- data.frame(a = c(0,1,0,1,0,1),
                 b = c(1,1,1,1,0,1),
                 c = c(0,0,0,1,1,1),
                 d = c(1,1,0,0,0,0))
n <- names(df)
out <- sapply(n, function(x)ifelse(df[[x]] == 1, x, ""))
combs <- apply(out, 1, paste, collapse="")
sort(table(combs))
# combs
# abd   b  bd   c abc 
#   1   1   1   1   2

van Nijnatten · Answer 4 · 2021-03-28T13:30:26.673

Ok, so let's use your data, including one row without any 1's:

df <- data.frame(
  a = c(0,1,0,1,0,1,0),
  b = c(1,1,1,1,0,1,0),
  c = c(0,0,0,1,1,1,0),
  d = c(1,1,0,0,0,0,0)
)

Now I want to paste all column names together if they have a 1, and then make that a wide table (so that all have a column for a combination). Of course, I fill all resulting NAs with 0's.

df2 <- df %>%
  dplyr::mutate(
    combination = paste0(
      ifelse(a == 1, "a", ""),    # There is possibly a way to automate this as well using across()
      ifelse(b == 1, "b", ""),
      ifelse(c == 1, "c", ""),
      ifelse(d == 1, "d", "")
    ),
    combination = ifelse(
      combination == "",
      "nothing",
      paste0("comb_", combination)
    ),
    value = ifelse(
      is.na(combination),
      0,
      1
    ),
    i = dplyr::row_number()
  ) %>%
  tidyr::pivot_wider(
    names_from = combination,
    values_from = value,
    names_repair = "unique"
  ) %>%
  replace(., is.na(.), 0) %>%
  dplyr::select(-i)

Since you want to order the original df by frequency, you can create a summary of all combinations (excluding those without anything filled in). Then you just make it a long table and pull the column for every combination (arranged by frequency) from the table.

comb_in_order <- df2 %>%
  dplyr::select(
    -tidyselect::any_of(
      c(
        names(df),
        "nothing"     # I think you want these last.
      )
    )
  ) %>%
  dplyr::summarise(
    dplyr::across(
      .cols = tidyselect::everything(),
      .fns = sum
    )
  ) %>%
  tidyr::pivot_longer(
    cols = tidyselect::everything(),
    names_to = "combination",
    values_to = "frequency"
  ) %>%
  dplyr::arrange(
    dplyr::desc(frequency)
  ) %>%
  dplyr::pull(combination)

The only thing to do then is to reconstruct the original df by these after arranging by the columns.

df2 %>%
  dplyr::arrange(
    across(
      tidyselect::any_of(comb_in_order),
      desc
    )
  ) %>%
  dplyr::select(
    tidyselect::any_of(names(df))
  )

This should work for all possible combinations.

extracting unique combinations from a long list of binary variables

4 Answers4