0

I am doing kmeans clustering, and want to test the resultant clusters are statistically different. In 3 level clustering, I test cluster 0 with cluster 1 and then with cluster 2. Then I test cluster 2 with cluster 3. I tried to apply t-test clustering as shown in the following code. The clusters have different lengths as you know. I am confused about the logic? Should I use p>0.05 or p<0.05. Then where to put True and False?

  def compare_2_groups(ar1,ar2):
    s,p=ttest_ind(ar1,ar2)
    #if p>0.05:
    if p<0.05:
        return False
    else:
        return True
Saif
  • 95
  • 8

1 Answers1

1

This procedure should work, even if ar1 and ar2 have different lengths. The p value result indicates the strength of evidence AGAINST the null hypothesis that the two clusters have the same center, where smaller p indicates stronger evidence. Two suggestions:

  • rename the function to reflect the nature of the test, like "are_group_centers_equal"
  • if using this name return False if p < (your threshold), True otherwise

If you choose a name with the opposite meaning "are_group_centers_different", reverse the logic of the threshold test, returning True if p < (threshold).

AbbeGijly
  • 1,191
  • 1
  • 4
  • 5
  • Dear AbbeGijly, If I have clusters that have numeric and categorical values, can I still use the t-test? and is there Python t-test code for mixed values? – Saif Dec 28 '21 at 16:01