2

I have been using the same variations of a pheatmap code to make heat maps for several months now without any problems, but lately it has stopped being able to cluster rows. Columns still cluster like normal but whenever I try to add row clustering it gives me the same error message about NA/NaN/Inf in the data

All of my datasets look very similar, with primarily just the number of rows changing (between 40-2000+). Here is a head of data I'm currently using, with all 0s already replaced with NA:

> head(protdata, 4)
          PR1      PO1      WA1     PR2      PO2      WA2      PR3      PO3     WA3      PR4 PO4     WA4      PR5      PO5
[1,] 0.004420       NA 0.002370 0.00141 0.002890 0.003740 4.36e-03 0.005370 0.00143 0.002070  NA 0.00428 0.005220       NA
[2,] 0.000233 8.85e-06 0.000136      NA 0.000056 0.000713 5.98e-05       NA      NA 0.000541  NA      NA 0.006700 4.95e-05
[3,] 0.001220 1.79e-05 0.000447 0.00183 0.000136       NA 6.99e-04 0.000298 0.00267 0.001330  NA      NA 0.000655 1.36e-04
[4,] 0.001170 6.84e-04 0.000282 0.00173 0.001620 0.000648 1.05e-03 0.003570 0.00101 0.001410  NA      NA 0.002960       NA
          WA5     PR6      PO6      WA6      PR7      PO7      WA7
[1,] 0.001030 0.00448       NA 1.53e-03 0.005220 0.005520 1.86e-03
[2,] 0.000139 0.00145 0.000484 8.88e-05 0.000118 0.000122 1.79e-05
[3,] 0.003680 0.00033       NA       NA       NA 0.000163 3.99e-03
[4,] 0.000393 0.00023       NA       NA 0.000625       NA 7.15e-04

There are a lot of 0s in the datasets, but clustering has always worked as long as they are converted to NA. None of the columns or rows are zero variance. Here is the code I've been using to make the heat maps:

protdata <- as.matrix(input[,-1])
protdata[protdata == 0] <- NA

rownames <- input[,1]
annotation_row <- data.frame(rownames)
rownames(protdata) <- annotation_row$Gene

pheatmap(log10(protdata), scale="row", border_color=NA, na_col="white", breaks=seq(-2,2,.01),
     color=colorRampPalette(rev(brewer.pal(n=7, name="RdYlBu")))(400))

And here is the error message I keep getting:

Error in hclust(d, method = method) : 
  NA/NaN/Inf in foreign function call (arg 10)

The only way I can get a plot to appear is with cluster_rows=FALSE included in the above. I am stumped as to why this was working perfectly and now isn't, when as far as I know I haven't changed anything with the way I'm inputting my data.

Any help would be greatly appreciated!!

paige
  • 23
  • 1
  • 3
  • it will not work once between 2 rows, there is an absence of complete observations – StupidWolf Apr 28 '20 at 08:50
  • like this pheatmap(matrix(c(NA,1,NA,2,3,NA),ncol=3)) – StupidWolf Apr 28 '20 at 08:51
  • @StupidWolf I don't understand? I have used this exact code with several other datasets filled with NAs and have never had a problem with missing data. All of my rows have >2 samples with z scores and none of them are zero variance – paige Apr 28 '20 at 19:58
  • did you check the example i provided? you can see it's impossible to calculate a distance. checking > 2 samples etc does not guarantee this – StupidWolf Apr 28 '20 at 19:59
  • @StupidWolf Yes but in the example there is not a sufficient number of values to calculate distances, whereas in mine shouldn't there be plenty? – paige Apr 28 '20 at 20:17
  • this is more representative and it works? `pheatmap(matrix(c(NA,1,2,4,NA,2,3,1,NA),ncol=3))` – paige Apr 28 '20 at 20:18
  • as long as there's one pair where you cannot calculate, it throws an error. In your code, you did nothing to check against what I have just said. thats my point. ok do this, ```table(is.na(dist(protdata))))``` – StupidWolf Apr 28 '20 at 20:27
  • hi @paige, if you can provide a link to the data. I can check it for you, which rows are giving the problem. this way it saves the miscommunication – StupidWolf Apr 28 '20 at 20:56
  • @StupidWolf sorry I'm so confused about this! here is a link to one set of data https://github.com/bosschard/peet_protdata – paige Apr 28 '20 at 22:29
  • No worries. thanks for sharing the data. I took a look. There are like about 60 rows which are giving problems for pairwise euclidean distances... we can remove them first so you can plot – StupidWolf Apr 28 '20 at 22:42
  • Ok I did a quick calculation.. removing 14 of them will work, i can write it below as an answer and you see whether it means sense – StupidWolf Apr 28 '20 at 22:47

2 Answers2

1

I converted your file into a csv and read in:

mat = read.csv("peet_protdata.csv",row.names=1)
mat[mat==0] = NA

No rows with all NAs or zero variance like you said, but if you do the dist calculation, there are NAs in some of the entries, indicating between some rows, it's not possible to calculate euclidean distances. You need to the euclidean distance matrix to have no NAs to do clustering:

 sum(is.na(as.matrix(dist(mat))))
[1] 434

Below is a quick (nasty) bit to find the rows giving the most NAs, remove them to give you a complete distance matrix:

giveNAs = which(is.na(as.matrix(dist(mat))),arr.ind=TRUE)
head(giveNAs)
    row col
G103  18   1
G100  53   1

So for example, row 18 and row 1 is giving you problems, and you can see there are no complete observations (pairwise):

mat[c(1,18),]
     PR1 PO1 WA1         PR2 PO2 WA2         PR3 PO3 WA3         PR4
G56   NA  NA  NA 0.000483209  NA  NA 0.000433088  NA  NA 0.000203604
G103  NA  NA  NA          NA  NA  NA          NA  NA  NA          NA
             PO4         WA4         PR5        PO5 WA5 PR6 PO6 WA6 PR7
G56  0.000294898 0.000269724 0.000299341 0.00046987  NA  NA  NA  NA  NA
G103          NA          NA          NA         NA  NA  NA  NA  NA  NA
             PO7         WA7         PR8 PO8 WA8         PR9         PO9 WA9
G56  0.000682594 0.000656168 0.000702988  NA  NA          NA          NA  NA
G103          NA          NA          NA  NA  NA 0.000629987 0.000504159  NA

We get the rows out and start checking what to remove:

tab = sort(table(c(giveNAs)),decreasing=TRUE)
checkNA = sapply(1:length(tab),function(i){
sum(is.na(as.matrix(dist(mat[-as.numeric(names(tab[1:i])),]))))
})
rmv = names(tab)[1:min(which(checkNA==0))]

 [1] "18"  "53"  "81"  "84"  "54"  "97"  "55"  "38"  "70"  "100" "31"  "93" 
[13] "52"  "80"  "91"

We remove those 15 rows:

mat = mat[-as.numeric(rmv),]
pheatmap(mat)

enter image description here

StupidWolf
  • 45,075
  • 17
  • 40
  • 72
  • Awesome, I just tried this on a couple of my other datasets that also hadn't been working and they all are now! Thank you for taking the time to help with this! – paige Apr 29 '20 at 00:21
  • cool. no problem.. yeah sorry it was a bit hard to explain the problem in words. glad you could move over the frustration of plotting :) – StupidWolf Apr 29 '20 at 00:23
0

You will need to remove your NAs before calling pheatmap (by changing to zeros here)

mat[is.na(mat)] = 0

pheatmap(mat)
  • This does not work with all the zeros and is not reflective of the data (zero isn't actually zero, just too low to detect), which is why I had to convert them to NAs in the first place. – paige Nov 03 '20 at 00:00