Sequence Combination Matrix in R

Question

I'm looking to create a matrix for 5 variables, such that each variable takes a value from seq(from = 0, to = 1, length.out = 500) and rowSums(data) = 1 .

In other words, I am wondering how to create a matrix that shows all the possible combinations of numbers with the sum of every row = 1.

Hi Cezar, I'm not sure your question is clear enough. Do you want to populate a matrix with 5 columns using ony 0s and 1s so that each row has sum 1? That would mean you want the Identity matrix. Is that correct? — Diego Rodrigues, Jun 12 '18 at 17:07
Not quite. I am trying to assign 5 vectors values from 0 to 1 (in this example, 500 rational numbers from 0 to 1), and identify all the possible combinations of vectors that have the sum 1. In other words, I would have a matrix with 5 columns and less than 500^5 rows (as not all the rational numbers in this case will sum 1). Hope this makes sense. — Cezar Visan, Jun 12 '18 at 17:35
yes! so it could be 0 0 0 0 1, as well as 0 0 0 1 0. Basically the only valid combinations are the ones with the "weight" = 1. — Cezar Visan, Jun 12 '18 at 17:42
To avoid precision problems, I'd suggest working with integers. Look for all combinations of 5 positive integers that sum to 500. You can divide any answer through by 500 to convert back to your original problem. — Gregor Thomas, Jun 12 '18 at 17:55
This is a big matrix that cannot be done in R.. just an example `a=expand.grid(rep(list(seq(0,1,length.out = 20)),5));`nrow(a[rowSums(a)==1,])` gives 8385.. Showing that from a length of 20. we have a matrix with rows 8385 made that satisfies the conditionbut this is obtained from a matrix of ` nrow(a)` which is `3,200,000` rows.. Since the number of rows increase exponentially, it is impossible to have a sequence of 500. But you can use for-loops and try your luck — Onyambu, Jun 12 '18 at 17:58
The number of rows is the number of integer compositions of 500 into 5 parts. [Using this formula from wikipedia](https://en.wikipedia.org/wiki/Composition_(combinatorics)#Number_of_compositions), that is `choose(499, 4) = 2552446876` rows (about 2.5 Billion) rows. How much memory and time do you have? What is the purpose/what is the actual problem you are trying to solve here? — Gregor Thomas, Jun 12 '18 at 18:12
Though if you are willing to let the order matter, that cuts the number down by a factor of about 120, to a somewhat more reasonable 21 million row result. — Gregor Thomas, Jun 12 '18 at 18:18
Thank you all for responses, seems like it is impossible to go at it as I originally intended. @Gregor my idea was to create a portfolio allocation for 5 assets. For this, I would need to look at the possible combination of weights, summing up to 1. It might be unreasonable to use 500 weights, but it worked fine for 3 assets and I *stupidly* assumed it would work for 5. — Cezar Visan, Jun 12 '18 at 18:32
@Onyambu, I just saw your comment. This can be done in R using the right tools. And quite quickly I might add. Using package partitions, it is possible. See my answer. — Joseph Wood, Jun 17 '18 at 21:59
@Gregor, the formula you are referencing is for the number of **_compositions_** where order matters... With partitions, order does not matter, which is what we are looking for here. Now, obtaining this number is quite tedious, but the result for this question is much more manageable... ~22 million. Continuing.... — Joseph Wood, Jun 17 '18 at 22:54
@Gregor, .... Here is a [link](https://github.com/randy3k/arrangements/blob/490cc5f9da20ebb2f13d18f60c92cc230167870d/src/k_partitions.c#L299) to Randy Lai's GitHub for the package `arrangements` that gives an algorithm for finding the number of restricted partitions that don't include zero. It isn't exactly the same as the algo in the package `partitions` as it has an option for including zero, however `arrangements::partitions(499, 5)` returns `21960586`, so we miss out on a few partitions.... This should still give you a good idea of how to do it for the general case. — Joseph Wood, Jun 17 '18 at 22:55
@JosephWood I took OP's earlier comment (comment #4 in this ridiculous comment thread) that `0 0 0 0 1` as well as `0 0 0 1 0` would both be counted in the result to mean that **order does matter**. Perhaps I misunderstood, but yes, I recommended compositions not partitions fully aware of the difference between the two. Order also matters in the accepted answer (see, e.g., rows 1 and 6 in Luis's answer). But perhaps OP can clarify that I am incorrect. — Gregor Thomas, Jun 18 '18 at 01:18

score 1 · Answer 1 · answered Jun 12 '18 at 17:46

If I understood correctly, this could take you to the right track at least.

# Parameters
len_vec = 500 # vector length
num_col = 5 # number of columns

# Creating the values for the matrix using rational numbers between 0 and 1
values <- runif(len_vec*num_col)

# Creating matrix
mat <- matrix(values,ncol = num_col,byrow = T)

# ROunding the matrix to create only 0s and 1s
mat <- round(mat)

# Calculating the sum per row
apply(mat,1,sum)

score 1 · Accepted Answer · answered Jun 12 '18 at 18:36

Here is an iterative solution, using loops. Gives you all possible permutations of numbers adding up to 1, with the distance between them being a multiple of N. The idea here is to take all numbers from 0 to 1 (with distance of a multiple of N between them), then for each one include in a new column all the numbers that when added don't go above 1. Rinse and repeat, except in the last iteration, in which you only add the numbers that complete the row the sum of the row.

Like people pointed out in the comments, if you want N = 1/499*, it will give you a really really big matrix. I noticed that for N = 1/200 it was already taking around 2, 3 minutes, so it would probably take way too long for N = 1/499.

*seq(from = 0, to = 1, length.out = 500) is the same as seq(from = 0, to = 1, by = 1/499)

N = 1/2
M = 5

x1 = seq(0, 1, by = N)

df = data.frame(x1)

for(i in 1:(M-2)){

  x_next = sapply(rowSums(df), function(x){seq(0, 1-x, by = N)})
  df = data.frame(sapply(df, rep, sapply(x_next,length)))
  df = cbind(df, unlist(x_next))

}

x_next = sapply(rowSums(df), function(x){1-x})
df = sapply(df, rep, sapply(x_next,length))
df = data.frame(df)
df = cbind(df, unlist(x_next))

> df
    x1 unlist.x_next. unlist.x_next..1 unlist.x_next..2 unlist(x_next)
1  0.0            0.0              0.0              0.0            1.0
2  0.0            0.0              0.0              0.5            0.5
3  0.0            0.0              0.0              1.0            0.0
4  0.0            0.0              0.5              0.0            0.5
5  0.0            0.0              0.5              0.5            0.0
6  0.0            0.0              1.0              0.0            0.0
7  0.0            0.5              0.0              0.0            0.5
8  0.0            0.5              0.0              0.5            0.0
9  0.0            0.5              0.5              0.0            0.0
10 0.0            1.0              0.0              0.0            0.0
11 0.5            0.0              0.0              0.0            0.5
12 0.5            0.0              0.0              0.5            0.0
13 0.5            0.0              0.5              0.0            0.0
14 0.5            0.5              0.0              0.0            0.0
15 1.0            0.0              0.0              0.0            0.0

this is a great approach! I would have never thought of going at it with loops! Thank you so much! — Cezar Visan, Jun 12 '18 at 18:53

score 1 · Answer 3 · answered Jun 12 '18 at 20:18

This is exactly what the package partitions is made for. Basically the OP is looking for all possible combinations of 5 integers that sum to 499. This can easily be achieved with restrictedparts:

system.time(combsOne <- t(as.matrix(restrictedparts(499, 5))) / 499)
 user  system elapsed 
1.635   0.867   2.502 


head(combsOne)
         [,1]        [,2] [,3] [,4] [,5]
[1,] 1.000000 0.000000000    0    0    0
[2,] 0.997996 0.002004008    0    0    0
[3,] 0.995992 0.004008016    0    0    0
[4,] 0.993988 0.006012024    0    0    0
[5,] 0.991984 0.008016032    0    0    0
[6,] 0.989980 0.010020040    0    0    0

tail(combsOne)
                 [,1]      [,2]      [,3]      [,4]      [,5]
[22849595,] 0.2024048 0.2004008 0.2004008 0.2004008 0.1963928
[22849596,] 0.2064128 0.1983968 0.1983968 0.1983968 0.1983968
[22849597,] 0.2044088 0.2004008 0.1983968 0.1983968 0.1983968
[22849598,] 0.2024048 0.2024048 0.1983968 0.1983968 0.1983968
[22849599,] 0.2024048 0.2004008 0.2004008 0.1983968 0.1983968
[22849600,] 0.2004008 0.2004008 0.2004008 0.2004008 0.1983968

And since we are dealing with numeric values we can't get exact precision, however we can get machine precision:

all(rowSums(combsOne) == 1)
[1] FALSE

all((rowSums(combsOne) - 1) < .Machine$double.eps)
[1] TRUE

There are over 22 million results:

row(combsOne)
[1] 22849600

Sequence Combination Matrix in R

3 Answers3