1

Suppose we have a data frame like this:

dat <- data.frame(
    a = rnorm(1000),
    b = 1/(rnorm(1000))^2,
    c = 1/rnorm(1000),
    d = as.factor(sample(c(0, 1, 2), 1000, replace=TRUE)),
    e = as.factor(sample(c('X', 'Y'), 1000, replace=TRUE))
)

We would like to compute a histogram on this data in all dimensions (i.e a, b, c, d, e) with specified breaks in each dimension. Obviously factor dimensions imply their breaks already. The final data should like like a data.frame where each row is a vector of breaks across all dimensions (combination of breaks) and the data occurrence count for this combination. Python numpy has histogramdd: Multidimension histogram in python. Is there something similar in R? What is the best way to do this in R? Thank you.

I ended up using the following, where bin counts are passed to the function as the last row:

dat <- data.frame(
    a = rnorm(1000),
    b = 1/(rnorm(1000))^2,
    c = 1/rnorm(1000),
    d = as.factor(sample(c(0, 1, 2), 1000, replace=TRUE)),
    e = as.factor(sample(c('X', 'Y'), 1000, replace=TRUE))
)

dat[nrow(dat)+1,] <- c(10,10,10,NaN,NaN)

histnd <- function(df) {
  res <- lapply(df, function(x) {
    bin_idx <- length(x)
    if (is.factor(x) || is.character(x)) {
      return(x[-bin_idx])
    }
    #
    x_min <- min(x[-bin_idx])
    x_max <- max(x[-bin_idx])
    breaks <- seq(x_min, x_max, (x_max - x_min)/x[bin_idx])
    cut(x[-bin_idx], breaks)
    })
  res <- do.call(data.frame, res)
  res$FR <- as.numeric(0)
  res <- aggregate(FR ~ ., res, length)
}

h <- histnd(dat)
Community
  • 1
  • 1
Dimon
  • 436
  • 5
  • 15

1 Answers1

1

I have no idea what the expected result is, but this should provide a starting point:

histnd <- function(DF) {
  res <- lapply(DF, function(x) {
    if (is.factor(x) || is.character(x)) return(x)
    breaks <- pretty(range(x), n = nclass.Sturges(x), min.n = 1)
    cut(x, breaks)
    })
  res <- do.call(data.frame, res)
  as.data.frame(table(res))
}

h <- histnd(dat)
Roland
  • 127,288
  • 10
  • 191
  • 288
  • Looks good except that the breaks (or bin count) have to be supplied by user for each numeric dimension. What's '-1' for? You function seems to be returning what is asked: each row is a vector of breaks across all dimensions (combination of breaks) and the data occurrence count for this combination – Dimon Sep 16 '15 at 19:04
  • Another thing is that 'table(res)' explodes the data size. Is there a way to do this for non-zero counts only? Yes you can certainly do 'h <- h[h$Freq>0,]' but it still would result in high spike in memory usage because of 'table(res)'. – Dimon Sep 16 '15 at 19:24
  • This could be improved if it was known what exactly the desired result is. E.g., I think melting with `length` as the aggregation function could be an option. It should be easy for you to adjust the function so that you can supply breaks manually. Take a look at the code of `hist.default` for an example of a rather comprehensive approach. – Roland Sep 17 '15 at 06:59
  • Ok thanks. Yes it is certainly not great for large data sets. For a DF with 10 columns and 10 breaks for each column I'm getting: Error in table(res) : attempt to make a table with >= 2^31 elements – Dimon Sep 17 '15 at 21:36
  • Ok thanks for your help. I posted modified function above – Dimon Sep 17 '15 at 23:37