3

I have a set of N points in k dimensions as a matrix of size N X k.

How can I find the best fitting line through these points? The line will be a plane (hyerpplane) in k dimensions. It will have k coefficients and one bias term.

Existing functions like fit seem to be usable only for points in 2 or 3 dimension.

S_S
  • 1,276
  • 4
  • 24
  • 47
  • Are you looking for [Multiple Linear Regression](https://www.mathworks.com/help/stats/regress.html)? – Mateen Ulhaq Jan 11 '18 at 05:48
  • Yes that seems to provide a solution. I only have a set of points in `k` dimensions, I do not have the response variable. So then should I take the last dimension as the response variable? But this would give me a `k-1` dimensional plane. – S_S Jan 11 '18 at 05:55
  • I would probably use [PCA](https://www.mathworks.com/help/stats/examples/fitting-an-orthogonal-regression-using-principal-components-analysis.html) since there's no distinction between predictor and response variables. – jodag Jan 11 '18 at 06:26
  • This should also be solvable as a linear least squares fitting problem, no? MATLAB has the cool `\\` operator for that. – Cris Luengo Jan 12 '18 at 03:32
  • I agree @CrisLuengo but \ also requires providing a response variable as it is y\x. Here since I only have a set of points I would need to decide on which to choose as a response variable.Further, if I need a hyperplane with a bias term, I need to augment the dataset with ones for getting a `k+1` dimensional weight vector, but this will give the trivial solution with coefficients `w=0` and bias `b=1`. – S_S Jan 12 '18 at 06:46
  • Sure, you need to pick one dimension as "special", if that axis is within the plane it won't do well. The bias term is not an issue. I was just writing this up as an answer. :) – Cris Luengo Jan 12 '18 at 06:53
  • 1
    Also, the PCA solution will work better if the data doesn't fill `k-1` dimensions, for example if it sits on a line in 3D. – Cris Luengo Jan 12 '18 at 06:56

2 Answers2

5

You can fit a hyperplane (or any lower dimensional affine space) to a set of D dimensional data using Principal Component Analysis. Here's an example of fitting a plane to a set of 3D data. This is explained in more detail in the MATLAB documentation but I tried to construct the simplest example I could.

% generate some random correlated data
D = 3;
mu = zeros(1,D);
sqrt_sig = randn(D);
sigma = sqrt_sig'*sqrt_sig;
% generate 50 points in a D x 50 matrix
X = mvnrnd(mu, sigma, 50)';

% perform PCA
coeff = pca(X');

% The last principal component is normal to the best fit plane and plane goes through mean of X
a = coeff(:,D);
b = -mean(X,2)'*a;

% plane defined by a'*x + b = 0
dist = abs(a'*X+b) / norm(a);
mse = mean(dist.^2)

Edit: Added example plot of results for D = 3. I take advantage of the orthogonality of the other principal components here. Ignore the code if you want it's just to demonstrate that the plane does in fact fit the data pretty well.

% plot in 3D
X0 = bsxfun(@minus,X,mean(X,2));
b1 = coeff(:,1);    b2 = coeff(:,2);
y1 = b1'*X0;        y2 = b2'*X0;
y1_min = min(y1);   y1_max = max(y1);
y1_span = y1_max - y1_min;
y2_min = min(y2);   y2_max = max(y2);
y2_span = y2_max - y2_min;
pad = 0.2;
y1_min = y1_min - pad*y1_span;
y1_max = y1_max + pad*y1_span;
y2_min = y2_min - pad*y2_span;
y2_max = y2_max + pad*y2_span;
[y1_m,y2_m] = meshgrid(linspace(y1_min,y1_max,5), linspace(y2_min,y2_max,5));
grid = bsxfun(@plus, bsxfun(@times,y1_m(:)',b1) + bsxfun(@times,y2_m(:)',b2), mean(X,2));
x = reshape(grid(1,:),size(y1_m));
y = reshape(grid(2,:),size(y1_m));
z = reshape(grid(3,:),size(y1_m));

figure(1); clf(1);
surf(x,y,z,'FaceColor','black','FaceAlpha',0.3,'EdgeAlpha',0.6);
hold on;
plot3(X(1,:),X(2,:),X(3,:),' .');
axis equal;
axis vis3d;

enter image description here enter image description here

Edit2: When I say "principal component" I'm being a bit sloppy (or just plain wrong) with the wording. I'm actually referring to the orthogonal basis vectors that the principal components are expressed in.

jodag
  • 19,885
  • 5
  • 47
  • 66
3

Here's a simpler solution, that just uses MATLAB's \ operator. We start with defining a plane in k dimensions:

%  0 = a + x(1) * b(1) + x(2) * b(2) + ... + x(k) * 1
k = 8;
a = randn(1);
b = randn(k-1,1);

(note that we assume b(k)=1, you can always multiply the plane parameters by any value without changing the plane).

Next we generate N random points within this plane:

N = 1000;
x = rand(N,k-1);
x(:,k) = -(a + x * b);

...sorry, it's not the best way to generate random points on the plane, but it's good enough for the demonstration here. Add noise to the points:

x = x + 0.05*randn(size(x));

To find the parameters of the plane, we solve the system of equations

%    a + x(1:k-1) * b == -x(k)

in the least-squares sense. a and b are the unknowns there. We can rewrite the left-hand side as [1,x(1:k-1)] * [a;b]. If we have a matrix equation M*p=v we can solve for p by writing p=M\v:

p = [ones(N,1),x(:,1:k-1)]\(-x(:,k));

disp(['ground truth: [a,b,1] = ',mat2str([a,b',1],3)]);
disp(['estimated   : [a,b,1] = ',mat2str([p',1],3)]);

This gives as output:

ground truth: [a,b,1] = [-1.35 -1.44 -1.48 1.17 0.226 -0.214 0.234 -1.59 1]
estimated   : [a,b,1] = [-1.41 -1.38 -1.43 1.14 0.219 -0.195 0.221 -1.54 1]

The less noise or the more points in the dataset, the smaller the error will be of course!

Cris Luengo
  • 55,762
  • 10
  • 62
  • 120