"multinomial expansion" of a dataset in R

Question

Having a dataset with an arbitrary number of columns N and rows T, I would like to obtain all the columns implied by a multinomial expansion of the columns sum raised to an arbitrary degree d.

To be clearer: having the following dataset as an example, with N = 3 and T = 10, with columns names being a, b, c

set.seed(123)
ds <- cbind("a"=rnorm(10),"b"=rnorm(10),"c"=rnorm(10)); ds 

> ds
                a          b          c
 [1,] -0.56047565  1.2240818 -1.0678237
 [2,] -0.23017749  0.3598138 -0.2179749
 [3,]  1.55870831  0.4007715 -1.0260044
 [4,]  0.07050839  0.1106827 -0.7288912
 [5,]  0.12928774 -0.5558411 -0.6250393
 [6,]  1.71506499  1.7869131 -1.6866933
 [7,]  0.46091621  0.4978505  0.8377870
 [8,] -1.26506123 -1.9666172  0.1533731
 [9,] -0.68685285  0.7013559 -1.1381369
[10,] -0.44566197 -0.4727914  1.2538149

the desired output for degree d = 2 would be a dataset with columns { a^2, b^2, c^2, a * b, a * c, b * c }, which in this case I can manually specify as

out <- cbind(ds[,"a"]^2, ds[,"b"]^2, ds[,"c"]^2, ds[,"a"]*ds[,"b"], ds[,"a"]*ds[,"c"], ds[,"b"]*ds[,"c"])

I am wondering what smart ways are out there to perform this automatically, maybe with a function that only takes ds and d as arguments.

EDIT: as the MWE suggests, I am not really interested in the multinomial coefficients, so feel free to consider them for completeness or not.

Is this ignoring the multinomial coefficients then ? – Nir Graham Jun 16 '23 at 10:52 — Nir Graham, Jun 16 '23 at 10:52
From the output by OP that appears to be the case. – mhovd Jun 16 '23 at 10:53 — mhovd, Jun 16 '23 at 10:53
you are right, I will make it clearer with an edit – oibaFox Jun 16 '23 at 13:51 — oibaFox, Jun 16 '23 at 13:51

Onyambu · Answer 1 · 2023-06-16T11:35:51.647

In base R the following would suffice:

poly(ds, degree = 2, raw = TRUE)[,]
            1.0.0       2.0.0      0.1.0       1.1.0      0.2.0      0.0.1       1.0.1       0.1.1      0.0.2
 [1,] -0.56047565 0.314132950  1.2240818 -0.68606804 1.49837625 -1.0678237  0.59848918 -1.30710356 1.14024747
 [2,] -0.23017749 0.052981677  0.3598138 -0.08282104 0.12946599 -0.2179749  0.05017292 -0.07843039 0.04751306
 [3,]  1.55870831 2.429571609  0.4007715  0.62468579 0.16061776 -1.0260044 -1.59924166 -0.41119329 1.05268513
 [4,]  0.07050839 0.004971433  0.1106827  0.00780406 0.01225066 -0.7288912 -0.05139295 -0.08067566 0.53128242
 [5,]  0.12928774 0.016715318 -0.5558411 -0.07186344 0.30895937 -0.6250393 -0.08080991  0.34742254 0.39067409
 [6,]  1.71506499 2.941447909  1.7869131  3.06467216 3.19305856 -1.6866933 -2.89278864 -3.01397443 2.84493432
 [7,]  0.46091621 0.212443749  0.4978505  0.22946735 0.24785510  0.8377870  0.38614963  0.41709268 0.70188713
 [8,] -1.26506123 1.600379927 -1.9666172  2.48789113 3.86758304  0.1533731 -0.19402639 -0.30162620 0.02352331
 [9,] -0.68685285 0.471766840  0.7013559 -0.48172830 0.49190010 -1.1381369  0.78173260 -0.79823906 1.29535569
[10,] -0.44566197 0.198614592 -0.4727914  0.21070515 0.22353172  1.2538149 -0.55877763 -0.59279292 1.57205186

Note that the column names shows the degree. ie 1.0.0 = a 2.0.0 = a^2 1.1.0=a*b etc.

You could of course create a small function to change the names accordingly:

namedPoly <- function(d, degree){
  x <- poly(d, degree = degree, raw = TRUE)[,]
  nms <- colnames(d)
  a <- t(read.table(text=colnames(x), sep='.'))
  b <- ifelse(a==0, "", ifelse(a==1, nms, paste0(nms, "^", a)))
  colnames(x) <- apply(b, 2, \(y)paste(y[nzchar(y)], collapse = "*"))
  x
}
 
 namedPoly(ds, 3)

the poly function seems exaclty what I was looking for. It works perfectly for a small number of columns, but with 26 columns of dataset it already stops working at degree 2, spitting `Error: cannot allocate vector of size 9469.2 Gb`. This looks strange to me since we are only talking about 350 columns of output. Do you have an idea of why and a solution? — oibaFox, Jun 17 '23 at 17:27

score 1 · Answer 2 · answered Jun 16 '23 at 10:52

One can get the polynomial x^2+y^2+z^2+xy+yz+zx with spray::homog(3, power = 2).

Unfortunately, there's no function in the spray package to extract the terms of a polynomial (a "spray"). Or I didn't find it. So I did one myself. We also need to get each term as a character string: "a^2", "ab", etc. So I also did a function to get these strings. Maybe using the mpoly package or the mvp package could provide such functions, I didn't check.

Finally there's the function as.function in spray to convert a polynomial to a function. So we have everything needed.

set.seed(123)
ds <- cbind("a" = rnorm(10), "b" = rnorm(10), "c" = rnorm(10))

library(spray)
P <- homog(ncol(ds), power = 2)

# get a polynomial term like  xy  as  "a*b"
as_character_term <- function(trm) {
  ops <- options(polyform = TRUE, sprayvars = colnames(ds))
  string <- capture.output(print_spray_polyform(trm))
  options(ops)
  substring(string, 2L)
}

# make list of terms of polynomial
terms <- function(P) {
  exponents <- index(P)
  coefficients <- coeffs(P)
  out <- lapply(1L:length(P), function(i) {
    as.spray(list(exponents[i, , drop = FALSE], coefficients[i]))
  })
  names(out) <- lapply(out, as_character_term)
  out
}

sapply(terms(P), function(trm) {
  f <- as.function(trm)
  f(ds)
})

#              c^2         b*c         a*c        b^2         a*b         a^2
#  [1,] 1.14024747 -1.30710356  0.59848918 1.49837625 -0.68606804 0.314132950
#  [2,] 0.04751306 -0.07843039  0.05017292 0.12946599 -0.08282104 0.052981677
#  [3,] 1.05268513 -0.41119329 -1.59924166 0.16061776  0.62468579 2.429571609
#  [4,] 0.53128242 -0.08067566 -0.05139295 0.01225066  0.00780406 0.004971433
#  [5,] 0.39067409  0.34742254 -0.08080991 0.30895937 -0.07186344 0.016715318
#  [6,] 2.84493432 -3.01397443 -2.89278864 3.19305856  3.06467216 2.941447909
#  [7,] 0.70188713  0.41709268  0.38614963 0.24785510  0.22946735 0.212443749
#  [8,] 0.02352331 -0.30162620 -0.19402639 3.86758304  2.48789113 1.600379927
#  [9,] 1.29535569 -0.79823906  0.78173260 0.49190010 -0.48172830 0.471766840
# [10,] 1.57205186 -0.59279292 -0.55877763 0.22353172  0.21070515 0.198614592

score 1 · Answer 3 · answered Jun 16 '23 at 10:54

This loop returns a matrix for a^d, b^d, c^d, (a*b)^(d-1), (a*c)^(d-1), (b*c)^(d-1) for each value between 1 and d.

set.seed(123)
ds <- cbind("a" = rnorm(10), "b" = rnorm(10), "c" = rnorm(10))
d <- 2
out <- numeric()

while (d > 0) {
  out <- cbind(ds[, "a"]^d, ds[, "b"]^d, ds[, "c"]^d)
  if (d > 1) {
    out <- cbind(out, (ds[, "a"] * ds[, "b"])^(d - 1), (ds[, "a"] * ds[, "c"])^(d - 1), (ds[, "b"] * ds[, "c"])^(d - 1))
  }
  print(out)
  d <- d - 1
}

Hence, the matrix for d=2 would be the first one, and the matrix for d=1 would be the second one.

             [,1]       [,2]       [,3]        [,4]        [,5]        [,6]
 [1,] 0.314132950 1.49837625 1.14024747 -0.68606804  0.59848918 -1.30710356
 [2,] 0.052981677 0.12946599 0.04751306 -0.08282104  0.05017292 -0.07843039
 [3,] 2.429571609 0.16061776 1.05268513  0.62468579 -1.59924166 -0.41119329
 [4,] 0.004971433 0.01225066 0.53128242  0.00780406 -0.05139295 -0.08067566
 [5,] 0.016715318 0.30895937 0.39067409 -0.07186344 -0.08080991  0.34742254
 [6,] 2.941447909 3.19305856 2.84493432  3.06467216 -2.89278864 -3.01397443
 [7,] 0.212443749 0.24785510 0.70188713  0.22946735  0.38614963  0.41709268
 [8,] 1.600379927 3.86758304 0.02352331  2.48789113 -0.19402639 -0.30162620
 [9,] 0.471766840 0.49190010 1.29535569 -0.48172830  0.78173260 -0.79823906
[10,] 0.198614592 0.22353172 1.57205186  0.21070515 -0.55877763 -0.59279292
             [,1]       [,2]       [,3]
 [1,] -0.56047565  1.2240818 -1.0678237
 [2,] -0.23017749  0.3598138 -0.2179749
 [3,]  1.55870831  0.4007715 -1.0260044
 [4,]  0.07050839  0.1106827 -0.7288912
 [5,]  0.12928774 -0.5558411 -0.6250393
 [6,]  1.71506499  1.7869131 -1.6866933
 [7,]  0.46091621  0.4978505  0.8377870
 [8,] -1.26506123 -1.9666172  0.1533731
 [9,] -0.68685285  0.7013559 -1.1381369
[10,] -0.44566197 -0.4727914  1.2538149

You can also save the matrices inside a list:

set.seed(123)
ds <- cbind("a" = rnorm(10), "b" = rnorm(10), "c" = rnorm(10))
d <- 2
out <- numeric()
res <- list()

while (d > 0) {
  out <- cbind(ds[, "a"]^d, ds[, "b"]^d, ds[, "c"]^d)
  if (d > 1) {
    out <- cbind(out, (ds[, "a"] * ds[, "b"])^(d - 1), (ds[, "a"] * ds[, "c"])^(d - 1), (ds[, "b"] * ds[, "c"])^(d - 1))
  }
  res[[d]] <- out
  d <- d - 1
}

Hope this helps!

"multinomial expansion" of a dataset in R

3 Answers3