I'm afraid I'm missing something obvious, but I just can't see what I am doing wrong.
If anyone can help me find it, please, it would be great.
Here's the full, symmetrical distance matrix I'm starting from:
d2 <- structure(list(P1 = c(0, 0.1, 0.3, 0.2, 0, 0.1), P2 = c(0.1,
0, 0.5, 0.7, 1, 0.9), P3 = c(0.3, 0.5, 0, 1, 0.2, 0.3), P4 = c(0.2,
0.7, 1, 0, 0.2, 0.5), P5 = c(0, 1, 0.2, 0.2, 0, 0.7), P6 = c(0.1,
0.9, 0.3, 0.5, 0.7, 0)), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -6L))
sum(abs(d2-t(d2)))
#[1] 0
I want to generate coordinates for the corresponding 6 points, so that the (euclidean) distance matrix resulting from those coordinates is as close as possible to my d2
.
From the cmdscale
documentation:
A set of Euclidean distances on n points can be represented exactly in at most n - 1 dimensions.
I would have thought (n-1)/2
dimensions would suffice, and indeed, when I run cmdscale
, if I go anywhere higher than k=3
I get something close to 0 for the higher coordinates, or even error messages:
cmdscale(d2,k=3)
# [,1] [,2] [,3]
#[1,] -0.03526127 0.07755701 1.708755e-05
#[2,] -0.50626939 0.31256816 -5.646907e-02
#[3,] -0.26333957 -0.40518119 -6.978213e-02
#[4,] 0.35902238 0.37455879 2.148406e-02
#[5,] 0.33997864 -0.17998635 -2.809260e-01
#[6,] 0.10586921 -0.17951643 3.856760e-01
cmdscale(d2,k=4)
# [,1] [,2] [,3] [,4]
#[1,] -0.03526127 0.07755701 1.708755e-05 -7.450581e-09
#[2,] -0.50626939 0.31256816 -5.646907e-02 -7.450581e-09
#[3,] -0.26333957 -0.40518119 -6.978213e-02 -7.450581e-09
#[4,] 0.35902238 0.37455879 2.148406e-02 -7.450581e-09
#[5,] 0.33997864 -0.17998635 -2.809260e-01 -7.450581e-09
#[6,] 0.10586921 -0.17951643 3.856760e-01 -7.450581e-09
cmdscale(d2,k=5)
# [,1] [,2] [,3] [,4]
#[1,] -0.03526127 0.07755701 1.708755e-05 -7.450581e-09
#[2,] -0.50626939 0.31256816 -5.646907e-02 -7.450581e-09
#[3,] -0.26333957 -0.40518119 -6.978213e-02 -7.450581e-09
#[4,] 0.35902238 0.37455879 2.148406e-02 -7.450581e-09
#[5,] 0.33997864 -0.17998635 -2.809260e-01 -7.450581e-09
#[6,] 0.10586921 -0.17951643 3.856760e-01 -7.450581e-09
#Warning message:
#In cmdscale(d2, k = 5) : only 4 of the first 5 eigenvalues are > 0
So, assuming that k=3
is sufficient, this is what happens when I try to reverse the operation:
dd <- dist(cmdscale(d2,k=3),diag = T,upper = T)
dd
# 1 2 3 4 5 6
#1 0.0000000 0.5294049 0.5384495 0.4940956 0.5348482 0.4844970
#2 0.5294049 0.0000000 0.7578630 0.8710048 1.0045529 0.9013064
#3 0.5384495 0.7578630 0.0000000 1.0018275 0.6777074 0.6282371
#4 0.4940956 0.8710048 1.0018275 0.0000000 0.6319294 0.7097335
#5 0.5348482 1.0045529 0.6777074 0.6319294 0.0000000 0.7065166
#6 0.4844970 0.9013064 0.6282371 0.7097335 0.7065166 0.0000000
Which is quite different from what I expected:
as.matrix(dd)-d2
# P1 P2 P3 P4 P5 P6
#1 0.0000000 0.429404930 0.238449457 0.294095619 0.534848178 0.384497043
#2 0.4294049 0.000000000 0.257862963 0.171004810 0.004552925 0.001306386
#3 0.2384495 0.257862963 0.000000000 0.001827507 0.477707386 0.328237091
#4 0.2940956 0.171004810 0.001827507 0.000000000 0.431929428 0.209733518
#5 0.5348482 0.004552925 0.477707386 0.431929428 0.000000000 0.006516573
#6 0.3844970 0.001306386 0.328237091 0.209733518 0.006516573 0.000000000
sum(abs(as.matrix(dd)-d2))
#[1] 7.543948
Has anyone got any idea why the two distance matrices don't match at all?
I could try building my own least squares problem to find the coordinates, but first I need to understand if I'm doing something wrong with these out of the box R functionalities.
Thanks!
EDIT possible inconsistency in the data found
Could the issue be that according to d2
points 1 and 5 coincide (they have distance 0):
as.matrix(d2)
# P1 P2 P3 P4 P5 P6
#[1,] 0.0 0.1 0.3 0.2 0.0 0.1
#[2,] 0.1 0.0 0.5 0.7 1.0 0.9
#[3,] 0.3 0.5 0.0 1.0 0.2 0.3
#[4,] 0.2 0.7 1.0 0.0 0.2 0.5
#[5,] 0.0 1.0 0.2 0.2 0.0 0.7
#[6,] 0.1 0.9 0.3 0.5 0.7 0.0
but then these two points have different distances from other points, e.g. d(1-2) is 0.1 whereas d(5-2) is 1?
Replacing the two 0's does not seem to help though:
d3 <- d2
d3[1,5] <- 0.2
d3[5,1] <- 0.2
dd3 <- cmdscale(as.matrix(d3),k=3)
sum(abs(as.matrix(dist(dd3))-as.matrix(d3)))
#[1] 7.168348
Does this perhaps indicate that not all distance matrices can be reduced to a completely consistent set of points, regardless of how many dimensions one uses?
EDIT 2 possible answer to the last question.
I suspect that the answer is yes. And I was wrong on the number of dimensions, I see now why you need N-1
rather than half that.
If I have a distance d(A-B) = 1, I can represent that in 2-1 = 1 dimensions (x axis), i.e. on a line, placing A in (xA=0) and B in (xB=1).
Then I introduce a third point C and I state that d(A-C) = 2.
I have 3 points, so I need 3-1 = 2 dimensions (xy plane).
The constraint given by d(A-C) is:
(xC - 0)^2 + (yC - 0)^2 = d(A-C)^2 = 4.
i.e. C can be anywhere on a circumference of radius 2 centred in A.
This constrains both xC and yC to be in [-2,2].
However, previously I had not considered that this constrains the possible values of d(B-C), too, because:
d(B-C)^2 = (xC - 1)^2 + (yC - 0)^2
thus, by substitution of the (yC - 0)^2 term:
d(B-C)^2 = (xC - 1)^2 + 4 - (xC - 0)^2 = -2*xC + 5
d(B-C)^2 is therefore bound to [-2*(+2)+5,-2*(-2)+5] = [1,9].
So if my distance matrix contained d(A-B) = 1, d(A-C) = 2 and d(B-C) anywhere outside [1,3], it would configure a system that does not correspond to 3 points in Euclidean space.
At least, I hope this makes sense.
So I guess my original question must be revoked.
I thought I'd leave the reasoning here for future reference or if anyone else should have the same doubt.