1

I have a function that calculates simple matching distances in a matrix with ordinal data:

require(proxy)
m <- test
f <- function(x,y) sum(x == y) / NROW(x)
matches <- as.matrix(dist(m, f, upper=TRUE))

The problem is that this function won't work when there are missing values, such as in the following matrix.

test <- structure(list(X1 = c(1, 2, 3, 4, 2, NA), X2 = c(2, 3, 4, 5, 
3, 6), X3 = c(3, 4, NA, 5, 3, 2), X4 = c(2, 4, 6, 5, 3, 8), X5 = c(1, 
3, 2, 4, 6, 4)), .Names = c("X1", "X2", "X3", "X4", "X5"), row.names = c(NA, 
6L), class = "data.frame")

The resulting distance matrix for this will be:

> matches
    1   2  3  4   5  6
1 0.0 0.0 NA  0 0.2 NA
2 0.0 0.0 NA  0 0.4 NA
3  NA  NA  0 NA  NA NA
4 0.0 0.0 NA  0 0.0 NA
5 0.2 0.4 NA  0 0.0 NA
6  NA  NA NA NA  NA  0

How can I adapt this function to calculate matching distances even when there are missing values?

Werner Hertzog
  • 2,002
  • 3
  • 24
  • 36

2 Answers2

1

Like this:

f <- function(x,y) mean(x == y, na.rm = TRUE)

as.matrix(dist(m, f, upper=TRUE))
#     1   2 3    4   5    6
# 1 0.0 0.0 0 0.00 0.2 0.00
# 2 0.0 0.0 0 0.00 0.4 0.00
# 3 0.0 0.0 0 0.00 0.0 0.00
# 4 0.0 0.0 0 0.00 0.0 0.25
# 5 0.2 0.4 0 0.00 0.0 0.00
# 6 0.0 0.0 0 0.25 0.0 0.00

Also be aware that numeric vectors are subject to floating point errors so == will not always return what you think. This won't be a problem if you store your data as a matrix of integers.

flodel
  • 87,577
  • 21
  • 185
  • 223
1

I'm not sure I fully understand your question, but it seems as though you want to treat NAs not as a missing value, but another 'category'. In that case, you could treat the columns in your data.frame as characters and paste an arbitrary character in front of everything (to have NAs act like they exist). For example,

for (i in 1:length(test)) test[[i]] <- paste0("*", as.character(test[[i]]))

Then

require(proxy)
m <- test
f <- function(x,y) sum(x == y) / length(x)
matches <- as.matrix(dist(m, f, upper=TRUE))

 1    2   3  4  5   6
1 0.0 0.0 0 0.0 0.2 0.0
2 0.0 0.0 0 0.0 0.4 0.0
3 0.0 0.0 0 0.0 0.0 0.0
4 0.0 0.0 0 0.0 0.0 0.2
5 0.2 0.4 0 0.0 0.0 0.0
6 0.0 0.0 0 0.2 0.0 0.0

Note that I changed NROW(x) to length(x)

Carson
  • 2,617
  • 1
  • 21
  • 24