9

I have a data which contain some NA value in their elements. What I want to do is to perform clustering without removing rows where the NA is present.

I understand that gower distance measure in daisy allow such situation. But why my code below doesn't work? I welcome other alternatives than 'daisy'.

# plot heat map with dendogram together.

library("gplots")
library("cluster")


# Arbitrarily assigning NA to some elements
mtcars[2,2] <- "NA"
mtcars[6,7]  <- "NA"

 mydata <- mtcars

hclustfunc <- function(x) hclust(x, method="complete")

# Initially I wanted to use this but it didn't take NA
#distfunc <- function(x) dist(x,method="euclidean")

# Try using daisy GOWER function 
# which suppose to work with NA value
distfunc <- function(x) daisy(x,metric="gower")

d <- distfunc(mydata)
fit <- hclustfunc(d)

# Perform clustering heatmap
heatmap.2(as.matrix(mydata),dendrogram="row",trace="none", margin=c(8,9), hclust=hclustfunc,distfun=distfunc);

The error message I got is this:

    Error in which(is.na) : argument to 'which' is not logical
Calls: distfunc.g -> daisy
In addition: Warning messages:
1: In data.matrix(x) : NAs introduced by coercion
2: In data.matrix(x) : NAs introduced by coercion
3: In daisy(x, metric = "gower") :
  binary variable(s) 8, 9 treated as interval scaled
Execution halted

At the end of the day, I'd like to perform hierarchical clustering with the NA allowed data.

Update

Converting with as.numeric work with example above. But why this code failed when read from text file?

library("gplots")
library("cluster")

# This time read from file
mtcars <- read.table("http://dpaste.com/1496666/plain/",na.strings="NA",sep="\t")

# Following suggestion convert to numeric
mydata <- apply( mtcars, 2, as.numeric )

hclustfunc <- function(x) hclust(x, method="complete")
#distfunc <- function(x) dist(x,method="euclidean")
# Try using daisy GOWER function 
distfunc <- function(x) daisy(x,metric="gower")

d <- distfunc(mydata)
fit <- hclustfunc(d)

heatmap.2(as.matrix(mydata),dendrogram="row",trace="none", margin=c(8,9), hclust=hclustfunc,distfun=distfunc);

The error I get is this:

  Warning messages:
1: In min(x) : no non-missing arguments to min; returning Inf
2: In max(x) : no non-missing arguments to max; returning -Inf
3: In min(x) : no non-missing arguments to min; returning Inf
4: In max(x) : no non-missing arguments to max; returning -Inf
Error in hclust(x, method = "complete") : 
  NA/NaN/Inf in foreign function call (arg 11)
Calls: hclustfunc -> hclust
Execution halted

~

neversaint
  • 60,904
  • 137
  • 310
  • 477
  • 2
    `"NA"` isn't the same as `NA`. But other than that how would you suggest to define the distance between two points when NA is one of the values? – Dason Dec 08 '13 at 04:36
  • 1
    In my understanding `daisy` take care of that http://stat.ethz.ch/R-manual/R-devel/library/cluster/html/daisy.html – neversaint Dec 08 '13 at 04:47
  • 1
    I don't get it, how did you solve this problem? I'm running into the same error message and I'm not able to find any site that explains what to do. I don't want to simply remove the NA values, I want them in my heatmap as "missing" or something similar. Please post the answer if you figured it out. Thanks. – AHegde Feb 06 '16 at 00:15

2 Answers2

5

The error is due to the presence of non-numeric variables in the data (numbers encoded as strings). You can convert them to numbers:

mydata <- apply( mtcars, 2, as.numeric )
d <- distfunc(mydata)
Vincent Zoonekynd
  • 31,893
  • 5
  • 69
  • 78
  • 5
    @neversaint, when you assign NA values to a numeric data.frame, do not use quotes. This caused your problem. Quotes are used to delimit character constants. If the intention is to change to numeric matrix, the presence of character values in your data will coerce the rest of the values in your matrix into character. – TWL Dec 07 '13 at 16:25
  • 1
    In your update, the file is not tab-delimited: you end up with only one column, and since its contents (whole rows) cannot be converted to numbers, everything is replaced by `NA`. – Vincent Zoonekynd Dec 08 '13 at 07:13
3

Using as.numeric may help in this case, but I do think that the original question points to a bug in the daisy function. Specifically, it has the following code:

    if (any(ina <- is.na(type3))) 
    stop(gettextf("invalid type %s for column numbers %s", 
        type2[ina], pColl(which(is.na))))

The intended error message is not printed, because which(is.na) is wrong. It should be which(ina).

I guess I should find out where / how to submit this bug now.

rakensi
  • 1,437
  • 1
  • 15
  • 20
  • 1
    Indeed, thank you @rakensi, also for reporting the typo/thinko which lead to a "not very helpful" error message instead of a helpful one. As you know, I have already fixed the code in the development version of the `cluster` package (http://svn.r-project.org/R-packages/trunk/cluster/R/daisy.q). – Martin Mächler Jun 17 '15 at 19:37