K-means clustering in R studio with 30+ microbiome samples

Question

I have 33 samples containing microbiome community data.

I have used the full_join command to combine all 33 samples into one dataframe. The dataframe looks like this:

          Samp1 Samp2 Samp3 Samp4
species1   0.1   8      0     2
species2   9     0.02   0     1
species3   0.3    1     1     0.1
species4    5     3    0.4    2

It is very large: 33 columns with 54,454 rows. I need to make a distance matrix to see how similar the columns (samples) are based on their abundance values (column values), which are based on species observations in each sample. From there (if I ever get there) I want to do k means clustering. Is this going to be possible in R? I have tried(with the corresponding error codes):

fviz_nbclust(samples, kmeans, method = "wss") + geom_vline(xintercept = 3, linetype = 2):
Error in do_one(nmeth) : NA/NaN/Inf in foreign function call (arg 1)
In addition: Warning messages:
1: In stats::dist(x) : NAs introduced by coercion
2: In storage.mode(x) <- "double" : NAs introduced by coercion

as well as:

mat <- dist(samples, method = "euclidean"):  
Error in do.call(".External", c(list(CFUN, x, y, pairwise, if (!is.function(method)) get(method) else method),  : 
  negative length vectors are not allowed

dput(samples[1:20, 1:5]):
structure(list(species = c("A0A1I7T9A8_9PELO", "A0A7J7AV85_9COLE", 
"A0A653T3J9_9MICO", "A0A6B0USQ1_IXORI", "W1W9S2_9STAP", "A0A653THV2_9MICO", 
"A0A0J7YLY8_BETVV", "A0A077ZKY0_TRITR", "A7A2I7_BIFAD", "A0A2C8AE20_9ACTN", 
"V8M0T5_STRTR", "A0A1B2YXC3_9BACT", "A0A2L2YUX1_PARTP", "A0A0K9Q7A6_SPIOL", 
"A0A0W0XZ17_9GAMM", "I0S7Z1_STRAP", "A0A1I7SXW9_9PELO", "A0A6A5L026_LUPAL", 
"A0A2Z5TND2_9STRE", "T1DQS5_ANOAQ"), mvh1 = c(27.76699, 9.61795, 
5.04776, 2.81076, 2.73102, 2.34273, 2.21013, 1.46822, 1.22727, 
1.13887, 0.90139, 0.83551, 0.82425, 0.74018, 0.67257, 0.6093, 
0.57897, 0.51136, 0.5001, 0.49229), mvh2 = c(NA, 0.00531, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA), mvh3 = c(NA, NA, NA, NA, NA, 0.96, 2.88, NA, NA, NA, NA, 
NA, NA, NA, NA, 1.92, NA, NA, NA, NA), mvh4 = c(0.02, 0.02285, 
NA, NA, NA, NA, 0.01286, 2.4369, NA, 0.23426, 5.17091, 0.04857, 
0.02571, NA, NA, 0.05142, 0.00571, 0.27283, 1.43414, 0.09428)), row.names = c(NA, 
20L), class = "data.frame")

Also when trying dist() on samptest, a subset of the same dataset, I do not get a matrix but I just get this:

 1            2            3            4            5            6
                7            8            9           10           11           12
               13           14           15           16           17           18
               19           20           21           22           23           24
               25           26           27           28           29           30
               31           32           33           34           35           36
               37           38           39           40           41           42
               43           44           45           46           47           48
               49           50           51           52           53           54
               55           56           57           58           59           60
               61           62           63           64           65           66
               67           68           69           70           71           72
               73           74           75           76           77           78
               79           80           81           82           83           84
               85           86           87           88           89           90
               91           92           93           94           95           96
               97           98           99          100          101          102
              103          104          105          106          107          108
              109          110          111          112          113          114
              115          116          117          118          119          120

So it doesn't seem it be calculating relationships between the columns.

Thanks

Can you edit the question with the output of `dput(samptest[1:20, 1:5])`? And which is your data set with 33 columns and 54k rows, `samples` or `samptest`? — Rui Barradas, Apr 20 '23 at 18:31
Sorry, samptest was a smaller subset I was testing out. The large dataset with 54k rows is samples. I added the output of dput. — Alex Gomez, Apr 20 '23 at 19:15
The first column is not numeric, try `dist(samples[-1], method = "euclidean")` to remove it from the calculation. — Rui Barradas, Apr 20 '23 at 19:18
Thanks, I got the same error code: Error in do.call(".Call", c(list(method), list(x), list(y), pairwise, : negative length vectors are not allowed — Alex Gomez, Apr 20 '23 at 19:21
See [this SO post](https://stackoverflow.com/questions/36469671/error-in-do-onenmeth-na-nan-inf-in-foreign-function-call-arg-1), your data has lots of `NA`'s. — Rui Barradas, Apr 20 '23 at 21:20

K-means clustering in R studio with 30+ microbiome samples

0 Answers0