clustering with NA values in R

Question

I was surprised to find out that clara from library(cluster) allows NAs. But function documentation says nothing about how it handles these values.

So my questions are:

How clara handles NAs?
Can this be somehow used for kmeans (Nas not allowed)?

[Update] So I did found lines of code in clara function:

inax <- is.na(x)
valmisdat <- 1.1 * max(abs(range(x, na.rm = TRUE)))
x[inax] <- valmisdat

which do missing value replacement by valmisdat. Not sure I understand the reason to use such formula. Any ideas? Would it be more "natural" to treat NAs by each column separately, maybe replacing with mean/median?

Gavin Simpson · Accepted Answer · 2012-05-24T08:40:04.270

9

Although not stated explicitly, I believe that NA are handled in the manner described in the ?daisy help page. The Details section has:

In the daisy algorithm, missing values in a row of x are not included in the dissimilarities involving that row.

Given internally the same code will be being used by clara() that is how I understand that NAs in the data can be handled - they just don't take part in the computation. This is a reasonably standard way of proceeding in such cases and is for example used in the definition of Gower's generalised similarity coefficient.

Update The C sources for clara.c clearly indicate that this (the above) is how NAs are handled by clara() (lines 350-356 in ./src/clara.c):

    if (has_NA && jtmd[j] < 0) { /* x[,j] has some Missing (NA) */
        /* in the following line (Fortran!), x[-2] ==> seg.fault
           {BDR to R-core, Sat, 3 Aug 2002} */
        if (x[lj] == valmd[j] || x[kj] == valmd[j]) {
        continue /* next j */;
        }
    }

edited May 24 '12 at 08:40

answered May 23 '12 at 14:19

Gavin Simpson

170,508
25
396
453

Same code lines to treat missing values in `daisy` as in `clara` function (see my question update). – danas.zuokas May 24 '12 at 07:14
@danas.zuokas I'm not sure how helpful it is to just pull arbitrary lines of code from the sources that you think a related to the question. You need to study both the R code and the C code. `valmisdat` is the value used to indicate missing data (`NA`) in the C code rather than have it use `NA` directly. If you look at the C code you will see that it clearly just ignores comparisons where a variable has a missing value for one or the other or both of the samples for which the dissimilarity is being computed. See the updated answer for the pointer to the code. – Gavin Simpson May 24 '12 at 08:36
Can you think of ways to employ same NA handling in `kmeans`? – danas.zuokas May 24 '12 at 09:23
3

Possibly but not without writing your own k-means algorithm. Essentially k-means works on the within-group sums of squares so distances to the centroid. `clara` is doing the same thing so the idea is feasible (you just ignore those comparisons when computing the Euclidean distance to the centroid and of the centroid itself I guess). Are you fixed on using k-means? If k-mediods is OK (and I don't see why it won't be as it is more robust than k-means), use the `pam()` function in the **cluster** package, which handles `NA`s like `clara()` and `daisy()`. – Gavin Simpson May 24 '12 at 11:12
@GavinSimpson did you have scalability issues using a matrix-oriented approach before k-means? Depending on memory allocation in r, the vector may be too large. – Scott Davis Jun 13 '14 at 18:00
@ScottDavis not really - I have 64Gb RAM :-) and no problems have exhausted that yet. I see your point, but you need the entire dissimilarity object to do k-means in R so this is an issue regardless of `NA`s, no? Or am I not following? – Gavin Simpson Jun 13 '14 at 18:08
@GavinSimpson Yes, I agree you need the whole dataset for k-means. Just wanted to bring up a problem that might happen. :) I had a dataset where the daisy package wouldn't load because I did not have enough RAM. – Scott Davis Jun 13 '14 at 18:23

score 3 · Answer 2 · edited Mar 05 '14 at 08:59

Not sure if kmeans can handle missing data by ignoring the missing values in a row.

There are two steps in kmeans;

calculating the distance between an observation and original cluster mean.
updating the new cluster mean based on the newly calculated distances.

When we have missing data in our observations: Step 1 can be handled by adjusting the distance metric appropriately as in the clara/pam/daisy package. But Step 2 can only be performed if we have some value for each column of an observation. Therefore imputing might be the next best option for kmeans to deal missing data.

Ben · Answer 3 · 2016-03-10T23:08:09.403

0

By looking at the Clara c code, I noticed that in clara algorithm, when there are missing values in the observations, the sum of squares is "reduced" proportional to the number of missing values, which I think is wrong! line 646 of clara.c is like " dsum *= (nobs / pp) " which shows it counts the number of non-missing values in each pair of observations (nobs), divides it by the number of variables (pp) and multiplies this by the sum of squares. I think it must be done in other way, i.e. " dsum *= (pp / nobs) ".

edited Mar 10 '16 at 23:08

answered Mar 06 '16 at 23:21

Ben

103
1
10

1

You can use an [edit link](https://stackoverflow.com/posts/35925440/edit) to edit your previous answer. – zero323 Mar 10 '16 at 19:50

clustering with NA values in R

3 Answers3

Linked