I have 33 samples containing microbiome community data.
I have used the full_join command to combine all 33 samples into one dataframe. The dataframe looks like this:
Samp1 Samp2 Samp3 Samp4
species1 0.1 8 0 2
species2 9 0.02 0 1
species3 0.3 1 1 0.1
species4 5 3 0.4 2
It is very large: 33 columns with 54,454 rows. I need to make a distance matrix to see how similar the columns (samples) are based on their abundance values (column values), which are based on species observations in each sample. From there (if I ever get there) I want to do k means clustering. Is this going to be possible in R? I have tried(with the corresponding error codes):
fviz_nbclust(samples, kmeans, method = "wss") + geom_vline(xintercept = 3, linetype = 2):
Error in do_one(nmeth) : NA/NaN/Inf in foreign function call (arg 1)
In addition: Warning messages:
1: In stats::dist(x) : NAs introduced by coercion
2: In storage.mode(x) <- "double" : NAs introduced by coercion
as well as:
mat <- dist(samples, method = "euclidean"):
Error in do.call(".External", c(list(CFUN, x, y, pairwise, if (!is.function(method)) get(method) else method), :
negative length vectors are not allowed
dput(samples[1:20, 1:5]):
structure(list(species = c("A0A1I7T9A8_9PELO", "A0A7J7AV85_9COLE",
"A0A653T3J9_9MICO", "A0A6B0USQ1_IXORI", "W1W9S2_9STAP", "A0A653THV2_9MICO",
"A0A0J7YLY8_BETVV", "A0A077ZKY0_TRITR", "A7A2I7_BIFAD", "A0A2C8AE20_9ACTN",
"V8M0T5_STRTR", "A0A1B2YXC3_9BACT", "A0A2L2YUX1_PARTP", "A0A0K9Q7A6_SPIOL",
"A0A0W0XZ17_9GAMM", "I0S7Z1_STRAP", "A0A1I7SXW9_9PELO", "A0A6A5L026_LUPAL",
"A0A2Z5TND2_9STRE", "T1DQS5_ANOAQ"), mvh1 = c(27.76699, 9.61795,
5.04776, 2.81076, 2.73102, 2.34273, 2.21013, 1.46822, 1.22727,
1.13887, 0.90139, 0.83551, 0.82425, 0.74018, 0.67257, 0.6093,
0.57897, 0.51136, 0.5001, 0.49229), mvh2 = c(NA, 0.00531, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA), mvh3 = c(NA, NA, NA, NA, NA, 0.96, 2.88, NA, NA, NA, NA,
NA, NA, NA, NA, 1.92, NA, NA, NA, NA), mvh4 = c(0.02, 0.02285,
NA, NA, NA, NA, 0.01286, 2.4369, NA, 0.23426, 5.17091, 0.04857,
0.02571, NA, NA, 0.05142, 0.00571, 0.27283, 1.43414, 0.09428)), row.names = c(NA,
20L), class = "data.frame")
Also when trying dist() on samptest, a subset of the same dataset, I do not get a matrix but I just get this:
1 2 3 4 5 6
7 8 9 10 11 12
13 14 15 16 17 18
19 20 21 22 23 24
25 26 27 28 29 30
31 32 33 34 35 36
37 38 39 40 41 42
43 44 45 46 47 48
49 50 51 52 53 54
55 56 57 58 59 60
61 62 63 64 65 66
67 68 69 70 71 72
73 74 75 76 77 78
79 80 81 82 83 84
85 86 87 88 89 90
91 92 93 94 95 96
97 98 99 100 101 102
103 104 105 106 107 108
109 110 111 112 113 114
115 116 117 118 119 120
So it doesn't seem it be calculating relationships between the columns.
Thanks