1

I have data with some columns as factor and some as character. I want to count all combinations and write a function with data.table syntax

# Load libraries

library(dplyr)
library(data.table)

# Create data

i_df = iris %>%
  filter(Species != 'virginica') %>%
  mutate(
    len   = ifelse(Sepal.Length > 6, 'large', 'tiny'),
    width = ifelse(Sepal.Width > 3, 'thick', 'thin'),
    color = ifelse(Species == 'setosa', 'green', 'red')
  ) %>% 
  mutate(
    len   = factor(len, levels = c('large', 'med_len', 'tiny')),
    width = factor(width, levels = c('thick', 'med_width', 'thin'))
  )

This would be an example of my function:

myfun = function(d, g, mode) {
  
  # Convert to data.table  
  setDT(d)
  
  # Counting
  res = d[, .N, by = g]
  
  # Complete combinations
  setkeyv(res, cols = g)
  
  res = switch(
    mode,
    manual = {
      res[CJ(levels(d$Species), levels(d$len), levels(d$width), unique(d$color)),]
    },
    auto = {
      m = res[, do.call(CJ, c(.SD, unique = TRUE)), .SDcols = g]
      res[m, on = g]
    }
  )
  
  # Add zero when NA
  res[is.na(res)] = 0
  
  # Return
  return(res)
  
}

How to run:

g_tmp = c('Species', 'len', 'width', 'color')

myfun(d = i_df, g = g_tmp, mode = 'manual')
myfun(d = i_df, g = g_tmp, mode = 'auto')

As you can see, I'm using setkeyv and not setkey, because I need use character vector g. But when complete with CJ, I cannot get it working with character vector input mode = 'auto'. There, indicate all factor levels for factors and all present colors unique for all character columns. As you can see, with mode = 'manual', 54 rows are returned, and with mode = 'auto', non-present factor levels are not returned, and result is 16 rows.

I've found this answer and this one but I cannot get it working when I have a mix of factor and character columns

As some colums are factors with some non-present levels, unique is not good here, only for the character columns

Archymedes
  • 431
  • 4
  • 15
  • Related: [Empty factors in "by" data.table](https://stackoverflow.com/questions/18866796/empty-factors-in-by-data-table#18866796) – Henrik Aug 16 '21 at 20:24
  • The "_Do not preserve column classes_" is unrelated to `by` dropping unused levels, but is a result of your next step: `table(i_dt[,..g]) %>% as.data.table()`. See also issues for a discussion: [Grouping could include unused factor levels while computing `j` (like `tapply` does)](https://github.com/Rdatatable/data.table/issues/562); [Should grouping by a factor always return a row for every level of the factor (no dropping missing levels)?](https://github.com/Rdatatable/data.table/issues/4421) – Henrik Aug 16 '21 at 20:56
  • Thanks, I saw those links before. With one of them, I discovered the base::table function – Archymedes Aug 16 '21 at 21:06
  • After thinking about it, your were right. I've simplified the question to the main one, how to write it as a general function – Archymedes Aug 17 '21 at 09:54
  • @Archymedes what do you want to do (exactly)? You described your problem but what you expect is not really clear (in my opinion). Why don't you add the expected output? – B. Christian Kamgang Aug 17 '21 at 11:04
  • @B. Christian Kamgang, I see. I've edited to explain the resulting problem. Entering manually all factor levels and unique, I have all combinations. With mode = 'auto', non-present factor levels are not completed – Archymedes Aug 17 '21 at 11:22
  • I have posted one possible solution. Let me know if it works. – B. Christian Kamgang Aug 17 '21 at 12:08

1 Answers1

2

Here is one possible way to solve your problem. Note that the argument with=FALSE in the data.table context allows to select the columns using the standard data.frame rules. In the example below, I assumed that the columns used to compute all combinations are passed to myfun as a character vector. Keep in mind that no columns in your dataset should be named gcases. .EACHI in by allows to perform some operation for each row in i.

myfun = function(d, g) {
  # get levels (for factors) and unique values for other types. 
  fn <- function(x) if(is.factor(x)) levels(x) else unique(x)
  gcases <- lapply(setDT(d, key=g)[, g, with=FALSE], fn)
  
  # count based on all combinations
  d[do.call(CJ, gcases), .N, keyby=.EACHI]
}