Any workaround to clustering mixed data types and render 3D scatter plot in R?

Question

I am trying to see data points distribution within labeled groups in the 3D plot, because I want to see the distribution of the data points and want to see how similar each group of data points in 3D space. To do so, I used scatterplot3d package from CRAN to get 3D to scatter plot, didn't get the correct plot for my data.

reproducible data

Here is the reproducible data that I used.

    > dput(head(phenDat,30))
structure(list(SampleID = c("Tarca_001_P1A01", "Tarca_013_P1B01", 
"Tarca_025_P1C01", "Tarca_037_P1D01", "Tarca_049_P1E01", "Tarca_061_P1F01", 
"Tarca_051_P1E03", "Tarca_063_P1F03", "Tarca_075_P1G03", "Tarca_087_P1H03", 
"Tarca_004_P1A04", "Tarca_064_P1F04", "Tarca_076_P1G04", "Tarca_088_P1H04", 
"Tarca_005_P1A05", "Tarca_017_P1B05", "Tarca_054_P1E06", "Tarca_066_P1F06", 
"Tarca_078_P1G06", "Tarca_090_P1H06", "Tarca_007_P1A07", "Tarca_019_P1B07", 
"Tarca_031_P1C07", "Tarca_079_P1G07", "Tarca_091_P1H07", "Tarca_008_P1A08", 
"Tarca_020_P1B08", "Tarca_022_P1B10", "Tarca_034_P1C10", "Tarca_046_P1D10"
), GA = c(11, 15.3, 21.7, 26.7, 31.3, 32.1, 19.7, 23.6, 27.6, 
30.6, 32.6, 12.6, 18.6, 25.6, 30.6, 36.4, 24.9, 28.9, 36.6, 19.9, 
26.1, 30.1, 36.7, 13.6, 17.6, 22.6, 24.7, 13.3, 19.7, 24.7), 
    Batch = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 
    2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 4L, 4L, 5L, 5L, 6L, 
    6L, 6L, 6L), Set = c("PRB_HTA", "PRB_HTA", "PRB_HTA", "PRB_HTA", 
    "PRB_HTA", "PRB_HTA", "PRB_HTA", "PRB_HTA", "PRB_HTA", "PRB_HTA", 
    "PRB_HTA", "PRB_HTA", "PRB_HTA", "PRB_HTA", "PRB_HTA", "PRB_HTA", 
    "PRB_HTA", "PRB_HTA", "PRB_HTA", "PRB_HTA", "PRB_HTA", "PRB_HTA", 
    "PRB_HTA", "PRB_HTA", "PRB_HTA", "PRB_HTA", "PRB_HTA", "PRB_HTA", 
    "PRB_HTA", "PRB_HTA"), Train = c(1L, 1L, 1L, 1L, 1L, 1L, 
    1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
    1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), Platform = c("HTA20", 
    "HTA20", "HTA20", "HTA20", "HTA20", "HTA20", "HTA20", "HTA20", 
    "HTA20", "HTA20", "HTA20", "HTA20", "HTA20", "HTA20", "HTA20", 
    "HTA20", "HTA20", "HTA20", "HTA20", "HTA20", "GSE113966", "GSE113966", 
    "GSE113966", "GSE113966", "GSE113966", "GSE113966", "GSE113966", "GSE113966", "GSE113966", 
    "GSE113966")), row.names = c(NA, 30L), class = "data.frame")

my attempt:

hclustfunc <- function(x) hclust(x, method="complete")
distfunc <- function(x) as.dist((1-cor(t(x)))/2)
d <- distfunc(persons_df)
fit <- hclustfunc(d)

my updated attempt:

library(rgl)
library(car)
scatter3d(x = PC1, y = PC2, z = PC3, surface = FALSE, groups = as.factor(clusters),  surface.col = cluster.colors, col = cluster.colors, xlab="PC1",ylab="PC2",zlab="PC3")

basically, I want to see data points (a.k.a, rows) that belong to different batch (or group), wanted to color them by some 'group' attribute. I just want to see how data points are similar to each other if we grouped them by different age categories, different batch, and different platform

I am thinking to use kmeans, PCA, other methods can give me different components that can be visualized in 3D plot, but this is not very intuitive to me how to do it in R?

desired plot:

I want to get 3D plot something like this:

can anyone point me out how can I possibly to make this happen? any way to get cluster my data and visualize it in 3D plot in R? Any thoughts? Thanks

update: simplest things might be possible:

I don't want to get too complicated solution in the first place, I just want to group data points (a.k.a, each rows) that belongs to different batch, platform, and age categories (I used findInterval(persons_df$ages, c(10,20,30,40,50))). Any way to make this happen in R?

Could you `dput(head(your_data,30))` and post the output? It's the best way to share your dataset, not everyone likes to download from unknown links. — s__, Jul 12 '19 at 06:33
@s_t thanks for the heads up. I just updated my post with pasting my data with `dput(head(my_data,30))`. I took the desired plot from one report, I just need a similar 3D plot. Could you have any idea or thought to make this happen in R? Thanks — Jerry07, Jul 12 '19 at 12:44
Perhaps this package could be useful for creating groups from your mixed data: https://cran.r-project.org/web/packages/PCAmixdata/vignettes/PCAmixdata.html — Jon Spring, Jul 14 '19 at 06:35
@JonSpring I am curious is that possible to use PCA, kmeans to do this? How can I make this happen by using PCA, kmeans? Could you give me a possible solution? thanks — Jerry07, Jul 14 '19 at 06:54
I've added to my solution to use a "k-modes" approach, which seems to work here for grouping with mixed categorical and quantitative data. — Jon Spring, Jul 14 '19 at 07:41
@JonSpring thanks a lot for your contribution. Before accepting your solution, I have a doubt, why data points distribution always flat in 3D plot? How can I get 3d scatter plot [desired scatter plot](https://stackoverflow.com/questions/24282143/pca-multiplot-in-r) ? Thank you — Jerry07, Jul 14 '19 at 15:56
@JonSpring plus I am having this error `Error in x[[jj]][iseq] <- vjj : replacement has length zero` when I used actual data. Could you give me a possible solution? I want to accept your solution. — Jerry07, Jul 14 '19 at 23:07
To be honest I haven't the foggiest idea why you're getting that error or how to fix. Is there a subset of the data you can share that creates that error? — Jon Spring, Jul 14 '19 at 23:45
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/196447/discussion-between-jon-spring-and-dan). — Jon Spring, Jul 14 '19 at 23:45

Jon Spring · Accepted Answer · 2019-07-14T07:39:53.137

Edit - added k-modes approach for mixed data clustering.

You might also consider plotly for 3d plotting. Here's an example with your data, where I've defined groups for every existing combination of Batch, Platform, and 10 year age bucket. In plotly these are assigned different colors, and you can double-click the group legends to toggle appearance. You'd need to modify for much bigger data, for instance you could remove Platform from the grouping since that's already mapped to z.

library(plotly); library(dplyr); library(RColorBrewer)
age_group = 10
phenDat %>% 
  mutate(group = paste(Batch, Platform, "age", 
                       floor(GA/age_group)*age_group, "-", 
                       floor(GA/age_group)*age_group + age_group - 1)) %>%
plot_ly(x = ~GA, y = ~Set, z = ~Platform, color = ~group) %>%
  add_markers(marker = list(size = 2,
                            color = colorRampPalette(brewer.pal(11,"Spectral"))(10))) %>%
  layout(scene = list(xaxis = list(title = "GA"),
                      yaxis = list(title = "Set"),
                      zaxis = list(title = "Platform")))

As for clustering given the mixed data, here's an approach using the klaR package's kmodes function, which seems to create plausible results here:

phenDat %>%
  bind_cols(cluster = klaR::kmodes(phenDat, 6)[["cluster"]] %>% as.character) %>%
  plot_ly(x = ~GA, y = ~Set, z = ~Platform, color = ~cluster) %>%
  add_markers(marker = list(size = 5)) %>%
  layout(scene = list(xaxis = list(title = "GA"),
                      yaxis = list(title = "Set"),
                      zaxis = list(title = "Platform")))

Any workaround to clustering mixed data types and render 3D scatter plot in R?

1 Answers1