1

I have a data.table that is quite large and have been attempting to return a list or vector of counts of specific default values in each column (they vary per column). It is organized as such:

set.seed(1); 
DT = as.data.table(matrix(sample(1:100, 100*100, TRUE), 100, 100))
#DT output below
param1  param2  param3  ... param100 #column names
1       1       1       ... 1        #first row = default values
.                                    #elems in remaining rows are random
.                                    # a param can be set to non default
1       666     1       ... 143      # or default values within a column
.
.
10000   1       1       ... 420

I am curious to know what a data.table way of doing this is? I have been sifting through past documentation, and am attempting to avoid for loops and methods that are intensive in memory and computation (I've attempted to use filter, lapply, and grouping, without luck).

An analogous example of what I am ideally looking for is with counting the number of non-NA values that exist per column:

count <- colSums(!is.na(DT))
#which outputs the following:
print(count)
param1  param2  param3  ... param177
1       292     0           7

Is there a way to do this similar to the colSums(!is.na(DT)) method, except for a given default value specific to each column? So instead of counting non-NA values for a given column, I would be counting for non-default values that appear in each column of my DT, where each default value pertaining to each column is located on the first row.

  • 1
    Consider making a reproducible example, maybe like `set.seed(1); DT = as.data.table(matrix(sample(1:100, 100*100, TRUE), 100, 100))`? It's not really clear what you mean. http://stackoverflow.com/a/28481250/ – Frank Aug 20 '16 at 00:12
  • @Frank noted, I will edit accordingly. – freudiandrip Aug 20 '16 at 21:21

2 Answers2

7

Well, I think this is what's meant:

set.seed(1)
DT = as.data.table(matrix(sample(1:1e5, 1e8, TRUE), ncol = 10))

# vector scan
sapply(DT, function(x) sum(x == x[1L]))
#  V1  V2  V3  V4  V5  V6  V7  V8  V9 V10 
# 124 100 111 101 113 101  94 108  79 112 

# binary search
sapply(names(DT), function(x){
    q = substitute(x == x[1L], list(x=as.symbol(x)))
    DT[eval(q), .N]
})
#  V1  V2  V3  V4  V5  V6  V7  V8  V9 V10 
# 124 100 111 101 113 101  94 108  79 112 

Benchmarks:

  • Vector scan takes .22 seconds
  • Binary search initially takes 1.8 seconds
  • Binary search every time after that is instant

The speedup is thanks to indices on the data.table, more specifically on auto indexing. See ?indices and read through the vignettes (this one appears in the fourth). Indices will speed up other operations on the data as well. To create them from the start, do for (nm in names(DT)) setindexv(DT, nm), though of course this will cost ~1.8 seconds.

Note: If your data are floating-point, you will run into trouble no matter how you approach this. Floats don't play nice with equality tests like x==x[1L].

Frank
  • 66,179
  • 8
  • 96
  • 180
2

We can use summarise_each from dplyr

library(dplyr)
DT %>% 
    summarise_each(funs(sum(.==first(.))))

With @Frank's example, it was done within

system.time(DT %>% 
     summarise_each(funs(sum(.==first(.)))))
#  user  system elapsed 
#  0.38    0.16    0.53 
akrun
  • 874,273
  • 37
  • 540
  • 662