I have a big matrix (1,000 rows and 50,000 columns). I know some columns are correlated (the rank is only 100) and I suspect some columns are even proportional. How can I find such proportional columns? (one way would be looping corr(M(:,j),M(:,k))
), but is there anything more efficient?

- 1,879
- 2
- 16
- 26
-
What do you mean by **even** proportional? Do you mean that one column is a scalar multiple of another? Like `column 3 = a*(column 1)`? – rayryeng Sep 25 '14 at 15:46
-
The question is ill-posed. Any column can be 'correlated' to any other one, meaning they are not orthogonal. Therefore it is not clear what does it mean 'finding such columns'. If you mean 'finding 100 orthogonal vectors that span the entire space generated by the column of the matrix' QR factorisation is the correct way as per the answer below. – Emerald Weapon Sep 25 '14 at 16:30
-
@rayryeng yes this is what I mean – teaLeef Sep 25 '14 at 17:23
-
@claudv I am looking for columns proportional to each other – teaLeef Sep 25 '14 at 17:24
-
@teaLeef - Then my answer should suffice. Take a look and let me know! – rayryeng Sep 25 '14 at 17:24
-
Can you afford to overwrite your matrix or define another one witth the same size? – Luis Mendo Sep 25 '14 at 18:01
-
@LuisMendo Yes I can write another – teaLeef Sep 25 '14 at 19:16
3 Answers
If I am understanding your problem correctly, you wish to determine those columns in your matrix that are linearly dependent, which means that one column is proportional or a scalar multiple of another. There's a very basic algorithm based on QR Decomposition. For QR decomposition, you can take any matrix and decompose it into a product of two matrices: Q
and R
. In other words:
A = Q*R
Q
is an orthogonal matrix with each column as being a unit vector, such that multiplying Q
by its transpose gives you the identity matrix (Q^{T}*Q = I
). R
is a right-triangular or upper-triangular matrix. One very useful theory by Golub and Van Loan in their 1996 book: Matrix Computations is that a matrix is considered full rank if all of the values of diagonal elements of R
are non-zero. Because of the floating point precision on computers, we will have to threshold and check for any values in the diagonal of R
that are greater than this tolerance. If it is, then this corresponding column is an independent column. We can simply find the absolute value of all of the diagonals, then check to see if they're greater than some tolerance.
We can slightly modify this so that we would search for values that are less than the tolerance which would mean that the column is not independent. The way you would call up the QR factorization is in this way:
[Q,R] = qr(A, 0);
Q
and R
are what I just talked about, and you specify the matrix A
as input. The second parameter 0
stands for producing an economy-size version of Q
and R
, where if this matrix was rectangular (like in your case), this would return a square matrix where the dimensions are the largest of the two sizes. In other words, if I had a matrix like 5 x 8
, producing an economy-size matrix will give you an output of 5 x 8
, where as not specifying the 0
will give you an 8 x 8
matrix.
Now, what we actually need is this style of invocation:
[Q,R,E] = qr(A, 0);
In this case, E
would be a permutation vector, such that:
A(:,E) = Q*R;
The reason why this is useful is because it orders the columns of Q
and R
in such a way that the first column of the re-ordered version is the most probable column that is independent, followed by those columns in decreasing order of "strength". As such, E
would tell you how likely each column is linearly independent and those "strengths" are in decreasing order. This "strength" is exactly captured in the diagonals of R
corresponding to this re-ordering. In fact, the strength is proportional to this first element. What you should do is check to see what diagonals of R
in the re-arranged version are greater than this first coefficient scaled by the tolerance and you use these to determine which of the corresponding columns are linearly independent.
However, I'm going to flip this around and determine the point in the R
diagonals where the last possible independent columns are located. Anything after this point would be considered linearly dependent. This is essentially the same as checking to see if any diagonals are less than the threshold, but we are using the re-ordering of the matrix to our advantage.
In any case, putting what I have mentioned in code, this is what you should do, assuming your matrix is stored in A
:
%// Step #1 - Define tolerance
tol = 1e-10;
%// Step #2 - Do QR Factorization
[Q, R, E] = qr(A,0);
diag_R = abs(diag(R)); %// Extract diagonals of R
%// Step #3 -
%// Find the LAST column in the re-arranged result that
%// satisfies the linearly independent property
r = find(diag_R >= tol*diag_R(1), 1, 'last');
%// Step #4
%// Anything after r means that the columns are
%// linearly dependent, so let's output those columns to the
%// user
idx = sort(E(r+1:end));
Note that E
will be a permutation vector, and I'm assuming you want these to be sorted so that's why we sort them after the point where the vectors fail to become linearly independent anymore. Let's test out this theory. Suppose I have this matrix:
A =
1 1 2 0
2 2 4 9
3 3 6 7
4 4 8 3
You can see that the first two columns are the same, and the third column is a multiple of the first or second. You would just have to multiply either one by 2 to get the result. If we run through the above code, this is what I get:
idx =
1 2
If you also take a look at E
, this is what I get:
E =
4 3 2 1
This means that column 4 was the "best" linearly independent column, followed by column 3. Because we returned [1,2]
as the linearly dependent columns, this means that columns 1 and 2 that both have [1,2,3,4]
as their columns are a scalar multiple of some other column. In this case, this would be column 3 as columns 1 and 2 are half of column 3.
Hope this helps!
Alternative Method
If you don't want to do any QR
factorization, then I can suggest reducing your matrix into its row-reduced Echelon form, and you can determine the basis vectors that make up the column space of your matrix A
. Essentially, the column space gives you the minimum set of columns that can generate all possible linear combinations of output vectors if you were to apply this matrix using matrix-vector multiplication. You can determine which columns form the column space by using the rref
command. You would provide a second output to rref
that gives you a vector of elements that tell you which columns are linearly independent or form a basis of the column space for that matrix. As such:
[B,RB] = rref(A);
RB
would give you the locations of which columns for the column space and B
would be the row-reduced echelon form of the matrix A
. Because you want to find those columns that are linearly dependent, you would want to return a set of elements that don't contain these locations. As such, define a linearly increasing vector from 1
to as many columns as you have, then use RB
to remove these entries in this vector and the result would be those linearly dependent columns you are seeking. In other words:
[B,RB] = rref(A);
idx = 1 : size(A,2);
idx(RB) = [];
By using the above code, this is what we get:
idx =
2 3
Bear in mind that we identified columns 2 and 3 to be linearly dependent, which makes sense as both are multiples of column 1. The identification of which columns are linearly dependent are different in comparison to the QR factorization method, as QR orders the columns based on how likely that particular column is linearly independent. Because columns 1 to 3 are related to each other, it shouldn't matter which column you return. One of these forms the basis of the other two.
I haven't tested the efficiency of using rref
in comparison to the QR method. I suspect that rref
does Gaussian row eliminations, where the complexity is worse compared to doing QR factorization as that algorithm is highly optimized and efficient. Because your matrix is rather large, I would stick to the QR method, but use rref
anyway and see what you get!

- 102,964
- 22
- 184
- 193
-
Great answer. I chose the other though cause I guess it might be better for the future readers. – teaLeef Sep 25 '14 at 19:17
-
@teaLeef - That's totally fine. When it comes to answers, I always get beat out by Luis Mendo lol. Good luck! – rayryeng Sep 25 '14 at 19:19
-
+1 for all the linear algebra :-) I see you have considered the problem of finding columns that are are in the column space spanned by others, right? It's a different problem than I solved (finding columns that are proportional to some other column) – Luis Mendo Sep 25 '14 at 20:09
-
@LuisMendo - Yes :) That's essentially what I was trying to solve. The OP computed the rank of the matrix, so I thought this was the objective the OP wanted - to identify those columns that don't belong in the column space of the matrix. We are fundamentally solving two different problems, but the either method will be useful to the OP as they essentially want to identify those columns that contribute to a non-full rank matrix. – rayryeng Sep 25 '14 at 20:12
-
@rayryeng I have one question then: is the answer to your problem unique? For example: consider `A = [1 2 3; 2 5 7]`. The third column is the sum of the others. So _any_ column is a linear combination of the others. Which column woluld you declare as linearly dependent? – Luis Mendo Sep 25 '14 at 20:23
-
@LuisMendo - No the answer wouldn't be unique as you have correctly noticed. Depending on which algorithm you ran, it will give you two different columns. By running through QR, we get the second column as being linearly dependent while RREF would give the third column. If we were to eliminate one of these columns, then this matrix would be linearly independent, but as soon as you add it back, this matrix becomes rank deficient. I should probably amend to my description where it locates those columns in the matrix that **are making the matrix rank deficient**. – rayryeng Sep 25 '14 at 20:29
-
@rayryeng I am curious for the case of Gram matrix G of A(nxm) with m>>n. Because of dimensionality G is easy to work with. Now rank(A)=rank(G), so does QR based method on G will give same result i.e., is indices found from G, also linearly independent indices of A. – Astro May 18 '15 at 06:08
-
@VinayakAbrol - I'm sorry to say that I'm outside of my element when it comes to what you are talking about. The best thing I can suggest is to try out my method on your data and see what you get. Good luck! – rayryeng May 18 '15 at 06:13
-
@rayryeng I tried the QR method on a real data matrix A (200x1000). Rank(A)=200, that means there are 200 linearly independent columns. Now i took 200 columns in matrix B from A, using first 200 indices of permutation matrix E via QR decomposition. But i found that det(B)=0, i.e., the extracted columns are not linearly independent. Is there anything wrong here? – Astro May 19 '15 at 05:37
If you normalize each column by dividing by its maximum, proportionality becomes equality. This makes the problem easier.
Now, to test for equality you can use a single (outer) loop over columns; the inner loop is easily vectorized with bsxfun
. For greater speed, compare each column only with the columns to its right.
Also to save some time, the result matrix is preallocated to an approximate size (you should set that). If the approximate size is wrong, the only penalty will be a little slower speed, but the code works.
As usual, tests for equality between floating-point values should include a tolerance.
The result is given as a 2-column matrix (S
), where each row contains the indices of two rows that are proportional.
A = [1 5 2 6 3 1
2 5 4 7 6 1
3 5 6 8 9 1]; %// example data matrix
tol = 1e-6; %// relative tolerance
A = bsxfun(@rdivide, A, max(A,[],1)); %// normalize A
C = size(A,2);
S = NaN(round(C^1.5),2); %// preallocate result to *approximate* size
used = 0; %// number of rows of S already used
for c = 1:C
ind = c+find(all(abs(bsxfun(@rdivide, A(:,c), A(:,c+1:end))-1)<tol));
u = numel(ind); %// number of columns proportional to column c
S(used+1:used+u,1) = c; %// fill in result
S(used+1:used+u,2) = ind; %// fill in result
used = used + u; %// update number of results
end
S = S(1:used,:); %// remove unused rows of S
In this example, the result is
S =
1 3
1 5
2 6
3 5
meaning column 1 is proportional to column 3; column 1 is proportional to column 5 etc.

- 110,752
- 13
- 76
- 147
-
+1 - Cool! Never considered doing it this way. I immediately went to linear algebra route. My method is probably a bit too complicated for the average reader. – rayryeng Sep 25 '14 at 18:40
If the determinant of a matrix is zero, then the columns are proportional.
There are 50,000 Columns, or 2 times 25,000 Columns.
It is easiest to solve the determinant of a 2 by 2 matrix.
Hence: to Find the proportional matrix, the longest time solution is to define the big-matrix on a spreadsheet.
Apply the determinant formula to a square beginning from 1st square on the left. Copy it for every row & column to arrive at the answer in the next spreadsheet.
Find the Columns with Determinants Zero.
This is Quite basic,not very time consuming and should be result oriented. Manual or Excel SpreadSheet(Efficient)

- 9
- 3
-
I don't seem to follow this solution. Also, the question is about MATLAB, not Excel. – rayryeng Mar 17 '18 at 22:22
-
If you are a programmer, you may be right..,but if you already know the definition of a proportional matrix.,you should enter the entire matrix,use the Excel Spread-Sheet Formula For the determinant of (any square of chosen pattern within entered rows & columns of elements) matrix- arrive at a solution as per your skill-set., – Sam Mar 19 '18 at 22:59