-3

I have this binary martix

a0=rep(1,40)
a=rep(0:1,20)
b=c(rep(1,20),rep(0,20))
c0=c(rep(0,12),rep(1,28))
c1=c(rep(1,5),rep(0,35))
c2=c(rep(1,8),rep(0,32))
c3=c(rep(1,23),rep(0,17))
da=matrix(cbind(a0,a,b,c0,c1,c2,c3),nrow=40,ncol=7)

I need to subset this matrix into two subset (matrix) having the same number of columns and different number of rows (say 85% vs 15%) but during the trim can you keep in mind that the 2 subsets don't have colinearity.

The problem I have. When I subset da using

ind <- sample(1:nrow(da), trunc(85*nrow(da)/100)) 
trda <- da[ind,] 
teda <- da[-ind,]

i get one of these subset not full rank.

Can some one explain to me how I can subset them without getting collinearity? this is just an example. I am dealing with big matrix

Thanks

Falcon-StatGuy
  • 347
  • 2
  • 6
  • 15
  • 1
    Given that a singular/non-singular matrix has to be square, I don't see how you can split an 80000x900 matrix into two squares... – Spacedman Sep 07 '12 at 14:43
  • I didn't know that because you didn't say anything about cross products. As it stands it sounds like you want to split a big matrix into two smaller matrices. What do you mean by 'split'? To me, it means cut the matrix along a row or column into two pieces. It doesn't mean take a subset of rows or columns which might be non-contiguous. You really need to edit your question, and maybe give us an example (with perhaps a 12x5 matrix as an example) – Spacedman Sep 07 '12 at 17:18

1 Answers1

0

Since you only have zeroes and ones in your rows, collinear rows are identical rows.

Compute the row string by pasting along columns:

> das = apply(da,1,paste,collapse="")
> das
 [1] "1010111" "1110111" "1010111" "1110111" "1010111" "1110011" "1010011"
 [8] "1110011" "1010001" "1110001" "1010001" "1110001" "1011001" "1111001"
[15] "1011001" "1111001" "1011001" "1111001" "1011001" "1111001" "1001001"
[22] "1101001" "1001001" "1101000" "1001000" "1101000" "1001000" "1101000"
[29] "1001000" "1101000" "1001000" "1101000" "1001000" "1101000" "1001000"
[36] "1101000" "1001000" "1101000" "1001000" "1101000"

Then a quick test for if it can be done or not is whether any string appears more than twice:

> any(table(das)>2)
[1] TRUE

because if there's more than two then one of your matrices will have at least two of the same row. You have eight 1001000 rows here, for example.

To do the actual split if it can be done, you just need to take one of each row that appears twice and put those in each matrix, and then any method you like to divide up the rest.

We on the right lines here?

Spacedman
  • 92,590
  • 12
  • 140
  • 224