Finding the odd point in a dataset without using loops

Question

I am given a set of points (p1,q1) (p2,q2) ... (p20,q20) which satisfy the function q = 1/(ap + b)^2 except that one of these does not satisfy the given relation. The values of a and b are not given to me. All I have with me is two inputs p and q as arrays. I need to find the index of the point which does not satisfy the given relation.

The way I proceeded to solve is to find the values of a and b using the first two pairs (p1,q1) and (p2,q2) and check if the remaining points satisfy the function for the solved values of a and b. The results will be stored in a logical matrix. I wish to make use of the logical matrix to pick out the odd pair, but unable to proceed further.

Specifically, the challenge is to make use of vectorization in MATLAB to find the odd point, instead of resorting to for-loops. I think that I will have to first search for the only logical zero in any of the row. In that case, the column index of that zero will fetch me the odd point. But, if there are more than one zeros in all 4 rows, then the odd point is either of the first two pairs. I need help in translating this to efficient code in MATLAB.

Please note that vectors p and q have been named as x and y in the below code.

function [res, sol] = findThePair(x, y)

N = length(x);

syms a b
vars = [a,b];
eqns = [y(1) - 1/(a*x(1) + b)^2 == 0; y(2) - 1/(a*x(2) + b)^2];
[solA, solB] = solve(eqns,vars);
sol = [double(solA) double(solB)];    %solution of a & b (total 4 possibilites)

xTest = x(3:end);   % performing check on remaining points
yTest = y(3:end);
res = zeros(4, N-2);    % logical matrix to store the results of equality check

for i = 1:4
    A = sol(i,1); B = sol(i, 2);
    res(i, :) = [yTest == 1./(A*xTest + B).^2]; % perform equality check on remaining points
end

Out of curiosity - if you plot x vs y, can you spot the outlier? Please explain what's wrong with your code and show us the inputs you're using (the vectors `p` and `q`). See also: [mcve] and [XY problem](http://xyproblem.info/). — Dev-iL, Mar 19 '19 at 14:40

Dev-iL · Answer 1 · 2019-03-19T15:59:04.913

I don't really understand how you were trying to solve this and what do syms (i.e. symbolic variables) have to do with this, so I'll show you how I would solve this problem.

Since we're essentially looking for an outlier, we might as well convert the problem to something that's easier to work with. For this reason, instead of using q as-is, I'm going to invert it: this way, we'd be dealing with an equation of a parabola - which is easy.

Next, knowing that our points should lie on a parabola, we can fit the equation of the parabola (or equivalently - find the coefficients of the polynomial that describes the relation of the input to the output). The polynomial is a^2*x^2+2*a*b*x+b^2, and so the coefficients are {a^2, 2*a*b, b^2}.

Since the majority of the points (19 out of 20) lie on the same parabola, the outlier will always have a larger error, which would make it stand out, no matter how close it is to the parabola (within the limitations of machine precision) - you can see an extreme example of this in the code below.

Fitting of a parabola is performed using polynomial interpolation (see also: Vandermonde matrix).

function I = q55241683()
%% Generate the ground truth:
TRUE_A = 2.3;
TRUE_B = -pi;
IDX_BAD = 5;

p = 1:0.04:1.76;
q = (TRUE_A * p + TRUE_B).^-2;
q(IDX_BAD) = (1-1E-10)*q(IDX_BAD); % notice just how close this is to being valid

%% Visualize dataset:
% figure(); plot(p,q.^-1);

%% Solve
I = findThePair(p, q.^-1);

%% Test
if IDX_BAD == I
  disp('Great success!');
else
  disp('Complete failure!');
end

end

function I = findThePair(x,y)
% Fit a parabola to {x vs. y^-1}
P = x(:).^(2:-1:0)\y(:); %alternatively: P = polyfit(x,y.^-1,2)
% Estimate {a,b} (or {-a,-b})
est_A = sqrt(P(1));
est_B = P(2)/(2*est_A);
% Compute the distances of the points from the fit (residuals), find the biggest:
[~,I] = max( abs(y - (est_A*x + est_B).^2) );
end

score 3 · Answer 2 · answered Mar 19 '19 at 15:47

Let's do some maths up front, to avoid needing loops or vectorisation. At most this leaves us with half a dozen function evaluations, and we only need 5 points.

q = 1 / (a*p + b)^2
% ->
sqrt(q) * ( a*p + b ) = 1
% ->
a = ( 1 - b*sqrt(q) ) / ( p * sqrt(q) )

% Sub in some points (1 and 2) ->
a1 = ( 1 - b*sqrt(q1) ) / ( p1 * sqrt(q1) )    
a2 = ( 1 - b*sqrt(q2) ) / ( p2 * sqrt(q2) )
% a1 and a2 should be the same ->
( 1 - b*sqrt(q1) ) * ( p2 * sqrt(q2) ) = ( 1 - b*sqrt(q2) ) * ( p1 * sqrt(q1) )
% Rearrange ->
b = ( p2*sqrt(q2) - p1*sqrt(q1) ) / ( (p2-p1)*sqrt(q1)*sqrt(q2) )

We have two unknowns, a and b. All we need are two points to create simultaneous equations. I'll use the following logic

Choose (pm, qm) and (pn, qn) with any m ~= n.
Calculate a and b using the above equation.
test whether (pr, qr) fits with the calculated a and b.
- If it fits, we know all three of these must be on the curve, and we have a and b.
- If it doesn't fit, we know either point m, n, or r is the outlier. Return to step (1) with two other points, the calculated a and b must be correct, as we've not fitted to the outlier.

Here is some code to implement this:

% Random coeffs, keep things unknown
a = rand*10;
b = rand*10;
% Set up our data
p = 1:20;
q = 1 ./ (a*p + b).^2;
% Create an outlier
q( 3 ) = q( 3 ) + 1;

% Steps as described 

% 1.
p1 = p(1); p2 = p(2);
q1 = q(1); q2 = q(2);

% 2.
bGuess = ( p2*sqrt(q2) - p1*sqrt(q1) ) / ( (p2-p1)*sqrt(q1)*sqrt(q2) );
aGuess = ( 1 - bGuess*sqrt(q1) ) / ( p1 * sqrt(q1) );

% 3.
p3 = p(3);
q3Guess = 1 / ( aGuess*p3 + bGuess )^2;

tol = 1e-7; % Use tolerance rather than == comparison to avoid float issues

if abs( q3Guess - q(3) ) < tol
    % success
    aFit = aGuess;
    bFit = bGuess;
else
    % p1, p2 or p3 is an outlier! Repeat using other points
    % If there's known to be only one outlier, this should give the result
    p1 = p(4); p2 = p(5);
    q1 = q(4); q2 = q(5);
    bFit = ( p2*sqrt(q2) - p1*sqrt(q1) ) / ( (p2-p1)*sqrt(q1)*sqrt(q2) );
    aFit = ( 1 - bFit*sqrt(q1) ) / ( p1 * sqrt(q1) );    
end

% Validate
fprintf( 'a is valid: %d, b is valid: %d\n', abs(a-aFit)<tol, abs(b-bFit)<tol )

How much of an "outlier" does it have to be, to be detectable? — Dev-iL, Mar 19 '19 at 15:52
@Dev-iL The function evaluation with the first guess has to be more than `tol` from the actual line to be an outlier. In this case, the numerical precision of the `sqrt` operations is likely the limiting factor. I'd think for any reasonable definition of "outlier" this would be pretty robust. — Wolfie, Mar 19 '19 at 15:53

Finding the odd point in a dataset without using loops

2 Answers2