0

I have a large dataset of clusters with values for a parameter. Multiple clusters can have the same value.

I want to make a cumulative percent frequency distribution plot, with cumulative percentage of no. of clusters in y axis and the parameter values (which ranges from 0-1) on x axis.

I have sorted the data based on the values, but after that I am not sure how can I process it to get the cumulative plot using R (ecdf) or matplotlib. How can I approach this? Any help would be greatly appreciated.

My data looks like this

Cluster_20637   0.020
Cluster_20919   0.020
Cluster_9642    0.147
Cluster_10141   0.148
Cluster_21451   0.148
Cluster_30198   0.148
Cluster_55982   0.498
Cluster_10883   0.500
Cluster_16641   0.500
Cluster_20143   0.500
Cluster_57942   0.867
Cluster_32878   0.868
Cluster_26249   0.870
Cluster_46928   0.870
Cluster_41908   0.871
Cluster_28603   0.872
Cluster_1419    0.873
psaima
  • 61
  • 1
  • 8

1 Answers1

1

Here's a dump of the data as a data.frame called test:

test <- structure(list(cluster = structure(c(6L, 7L, 17L, 1L, 8L, 11L, 
15L, 2L, 4L, 5L, 16L, 12L, 9L, 14L, 13L, 10L, 3L), .Label = c("Cluster_10141", 
"Cluster_10883", "Cluster_1419", "Cluster_16641", "Cluster_20143", 
"Cluster_20637", "Cluster_20919", "Cluster_21451", "Cluster_26249", 
"Cluster_28603", "Cluster_30198", "Cluster_32878", "Cluster_41908", 
"Cluster_46928", "Cluster_55982", "Cluster_57942", "Cluster_9642"
), class = "factor"), value = c(0.02, 0.02, 0.147, 0.148, 0.148, 
0.148, 0.498, 0.5, 0.5, 0.5, 0.867, 0.868, 0.87, 0.87, 0.871, 
0.872, 0.873)), .Names = c("cluster", "value"), row.names = c(NA, 
-17L), class = "data.frame")

Which looks like:

         cluster value
1  Cluster_20637 0.020
2  Cluster_20919 0.020
3   Cluster_9642 0.147
<<snip>>
16 Cluster_28603 0.872
17  Cluster_1419 0.873

Generate a cumulative percentage variable

> test$cumperc <- (1:nrow(test))/nrow(test)
> test

         cluster value    cumperc
1  Cluster_20637 0.020 0.05882353
2  Cluster_20919 0.020 0.11764706
3   Cluster_9642 0.147 0.17647059
<<snip>>
14 Cluster_46928 0.870 0.82352941
15 Cluster_41908 0.871 0.88235294
16 Cluster_28603 0.872 0.94117647
17  Cluster_1419 0.873 1.00000000

Then plot the data

plot(test$value,test$cumperc,type="l",xlim=c(0,1))

enter image description here

Edit to address comment below:

Try this to group the clusters first:

tabvals <- table(test$value)
plot(names(tabvals),(1:length(tabvals))/length(tabvals),xlim=c(0,1),type="l")

Which gives this plot:

enter image description here

thelatemail
  • 91,185
  • 12
  • 128
  • 188
  • Thanks, but I think this solution calculates for cumulative percent for each cluster, right? I want to calculate number of clusters which has the same value( e.g. in the test data no. of clusters with the value 0.020 is 2), then calculate cumulative percent of that cluster frequency and plot against the parameter. How to do this? – psaima Jun 02 '12 at 06:27