-2

I am having some problems trying to cluster countries using a sort of cultural correlation that I already have.

basically, the dataset looks like this: with 90 countries, 91 columns (90 country columns + one to identify the nations on the rows) and 90 rows

 Nation Ita   Fra   Ger   Esp   Eng  ...
 Ita    NA    0.2   0.1   0.6   0.4  ...
 Fra    0.2   NA    0.2   0.1   0.3  ...
 Ger    0.7   0.1   NA    0.5   0.4
 Esp    0.6   0.1   0.5   NA    0.2
 Eng    0.4   0.3   0.4   0.2   NA
 ...                              .....
 ...

I am looking for an algorithm that clusters my countries in groups (for instance groups of 3, or even better, more flexible clusters, such that the number of clusters and the number of countries per cluster is not fixed ex-ante

so that the output is for instance

  Nation   cluster
  Ita       1
  Fra       2
  Ger       3
  Esp       1
  Eng       3
  ......
desertnaut
  • 57,590
  • 26
  • 140
  • 166
Carbo
  • 906
  • 5
  • 23

3 Answers3

3
#DATA
df1 = read.table(strip.white = TRUE, stringsAsFactors = FALSE, header = TRUE, text =
"Nation Ita   Fra   Ger   Esp   Eng
 Ita    NA    0.2   0.1   0.6   0.4
 Fra    0.2   NA    0.2   0.1   0.3
 Ger    0.7   0.1   NA    0.5   0.4
 Esp    0.6   0.1   0.5   NA    0.2
 Eng    0.4   0.3   0.4   0.2   NA")

df1 = replace(df1, is.na(df1), 0)
row.names(df1) = df1[,1]
df1 = df1[,-1]

# Run PCA to visualize similarities
pca = prcomp(as.matrix(df1))    
pca_m = as.data.frame(pca$x)
plot(pca_m$PC1, pca_m$PC2)
text(x = pca_m$PC1, pca_m$PC2, labels = row.names(df1))

enter image description here

# Run k-means and choose centers based on pca plot
kk = kmeans(x = df1, centers = 3)
kk$cluster
# Ita Fra Ger Esp Eng 
#   3   1   2   1   1 
d.b
  • 32,245
  • 6
  • 36
  • 77
  • 3
    It appears that you are treating the matrix as a distance matrix, but the OP said it is cultural _similarity_ – G5W Feb 08 '19 at 02:23
  • 1
    FWIW, I tried a variation of this answer by using the inverse of the df1 values, and I got identical results. I used `df2 <- df1 %>% mutate_all(funs(1 - .))`, which converted 0.2 to 0.8, 0.7 to 0.3, etc., and then plugged that data frame into the `prcomp` and the rest. It's beyond my current understanding of PCA to understand if that supports or refutes the correctness of the answer. :-( – Jon Spring Feb 16 '19 at 16:54
  • 1
    thnk you for your help! – Carbo Feb 24 '19 at 15:08
3

Hierarchical Agglomerative Clustering (HAC), one of the oldest clustering methods, can also be implemented with similarity instead of distance.

Conceptually, you always search for the maximum (e.g., ita ger) and merge these until the desired number of clusters remain.

Although in your case it's probably easier to just use 1-sim as distance and use the existing implementations.

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
0

You might consider using spectral clustering, which is k-means applied to the dominant eigenvector(s) of the laplacian underlying your similarity graph. https://en.wikipedia.org/wiki/Spectral_clustering

Juan Carlos Ramirez
  • 2,054
  • 1
  • 7
  • 22