2

I calculated the CCC metric in R (package NbClust) and in SAS (https://support.sas.com/documentation/onlinedoc/v82/techreport_a108.pdf).

All the Pseudo-F and R-square are matching exactly with the SAS output except for the E_R2 and hence CCC. I have tried and tested in multiple ways but couldn’t get any explanation for this.

I have attached the "sample_data.csv", "initial_seed.csv" and "SAS_cluster.Rdata" files which I used here: https://drive.google.com/folderview?id=0ByZXEsgyZupEemFTdWRuTEVmZ1k&usp=sharing

Following are the values of expected R2 (E_R2) in R and SAS respectively:

E_R2 = 0.4630339 vs. ERSQ = 0.3732597284

Could you please help me out in finding what's going wrong in the background?

Kindly find below the codes I have used for this:

----------
<SAS-Code>
----------
proc fastclus data=sample_data maxiter=100 seed=initial_seed maxc=5 outstat=metrics out=output;
Var v1 v2 v3 v4 v5 v6;
run;

----------
<R-Code>
----------
load("SAS_cluster.Rdata")

clust.perf.metrics <- function(data, cl) {

  data1 <- as.matrix(data)
  numberObsBefore <- dim(data1)[1]
  data <- na.omit(data1)
  nn <- numberObsAfter <- dim(data)[1]
  pp <- dim(data)[2]
  qq <- max(cl)
  TT <- t(data) %*% data
  sizeEigenTT <- length(eigen(TT)$value)
  eigenValues <- eigen(TT/(nn - 1))$value

  for (i in 1:sizeEigenTT) {
    if (eigenValues[i] < 0) {
      cat(paste("There are only", numberObsAfter, "non-missing observations out of a possible",numberObsBefore, "observations."))

      stop("The TSS matrix is indefinite. There must be too many missing values. The index cannot be calculated.")
    }
  }

  s1 <- sqrt(eigenValues)
  ss <- rep(1, sizeEigenTT)

  for (i in 1:sizeEigenTT) {
    if (s1[i] != 0)
      ss[i] = s1[i]
  }

  vv <- prod(ss)
  z <- matrix(0, ncol = qq, nrow = nn)
  clX <- as.matrix(cl)

  for (i in 1:nn)
    for (j in 1:qq) {
      z[i, j] == 0
      if (clX[i, 1] == j) z[i, j] = 1
    }

  xbar <- solve(t(z) %*% z) %*% t(z) %*% data
  B <- t(xbar) %*% t(z) %*% z %*% xbar
  W <- TT - B
  R2 <- 1 - (sum(diag(W))/sum(diag(TT)))
  PseudoF <- (sum(diag(B))/(qq-1))/(sum(diag(W))/(nn-qq))

  v1 <- 1
  u1 <- rep(0, pp)
  c1 <- (vv/qq)^(1/pp)
  u1 <- ss/c1
  k1 <- sum((u1 >= 1) == TRUE)
  p1 <- min(k1, qq - 1)


  if (all(p1 > 0, p1 < pp)) {
    for (i in 1:p1) { v1 <- v1 * ss[i]}
    c <- (v1/qq)^(1/p1)
    u <- ss/c
    b1 <- sum(1/(nn + u[1:p1]))
    b2 <- sum(u[(p1 + 1):pp]^2/(nn + u[(p1 + 1):pp]), na.rm = TRUE)
    E_R2 <- 1 - ((b1 + b2)/sum(u^2)) * ((nn - qq)^2/nn) * (1 + (4/nn))
    ccc <- log((1 - E_R2)/(1 - R2)) * (sqrt(nn * p1/2)/((0.001 + E_R2)^1.2))

  } else {
    b1 <- sum(1/(nn + u))
    E_R2 <- 1 - (b1/sum(u^2)) * ((nn - qq)^2/nn) * (1 + 4/nn)
    ccc <- log((1 - E_R2)/(1 - R2)) * (sqrt(nn * pp/2)/((0.001 + E_R2)^1.2))
  }
  results <- list(R_2=R2, PseudoF=PseudoF, CCC = ccc, E_R2=E_R2);return(results)
}

clust.perf.metrics(output[,1:6],output[,7])

Thanks and Regards,

Chaks
  • 21
  • 4
  • 2
    Your link is regarding SAS 8.2, and a 31 year old paper. Is that the version of SAS you're using? – Reeza Feb 23 '15 at 19:42
  • 1
    A few thoughts: 1. As Reeza says it's an old paper and SAS may have improved their `E(R^2)` calculation since. 2. The `NbClust` method seems to diverge from that of the SAS paper in its selection of `p*` by using the original `u` rather than `u` recalculated with possible values of `p*`. – SRSwift Feb 23 '15 at 23:44
  • I couldn't find any newer version of the CCC manual from SAS. This was the only one I got. If you are aware of the recent changes in `E(R^2)` calculation by SAS, please let me know. Also, I tried selecting `p*` using `u` recalculated manually in excel with possible values of `p*` but didn't get a match. It will be really helpful if someone could kindly provide some pseudo code for this exercise...Thanks. – Chaks Feb 24 '15 at 07:55
  • 1
    I don't know of any new calculation, and can't see any from a web search. However, SAS doesn't always provide public documentation of their methods. I've had a quick go at reimplementing the process from the paper but have been unable to duplicate the SAS `E(R^2)`. The magnitude of the difference make me think SAS probably has a different approach. The `NbClust` approach uses the small sample calculation regardless of sample size, but changing this does not align the output with SAS. – SRSwift Feb 24 '15 at 23:48
  • Exactly!! That's the challenge I faced. So how to proceed now if I want to do it in R? Can I go with the code I shared in my post? Is it correct and eligible for defining cluster performance? Please suggest. Thanks a lot for all your help so far. I appreciate. – Chaks Feb 25 '15 at 10:09
  • 1
    I would be inclined to: implement the large sample `E(R^2)` calculation (page 7), verify that it acts as expected by plotting CCC against number of clusters for some known data (page 49), if you are satisfied that it's doing what you want then use it, if not use [an alternative method](https://en.wikipedia.org/wiki/Determining_the_number_of_clusters_in_a_data_set). I would also recommend getting in contact with the `NbClust` maintainer Malika Charrad to discuss this issue, she may be able to help, or update the package. – SRSwift Feb 25 '15 at 19:19
  • 2
    Thanks a lot for your suggestion. Let me try along that line. Also as you have suggested, I have already [contacted Malika](https://www.mail-archive.com/r-help@r-project.org/msg223061.html) before posting here but didn't get any reply yet. – Chaks Feb 26 '15 at 05:43
  • Hey Chaks! Have you been able to solve the question? I too, have stumbled across that and want to know why SAS fastclus returns ccc values not in concert with their old tech document "big" formula or Sarle. – ttnphns Jul 19 '16 at 13:11

0 Answers0