How to process data for a cumulative percent frequency plot in R

Question

I have a large dataset of clusters with values for a parameter. Multiple clusters can have the same value.

I want to make a cumulative percent frequency distribution plot, with cumulative percentage of no. of clusters in y axis and the parameter values (which ranges from 0-1) on x axis.

I have sorted the data based on the values, but after that I am not sure how can I process it to get the cumulative plot using R (ecdf) or matplotlib. How can I approach this? Any help would be greatly appreciated.

My data looks like this

Cluster_20637   0.020
Cluster_20919   0.020
Cluster_9642    0.147
Cluster_10141   0.148
Cluster_21451   0.148
Cluster_30198   0.148
Cluster_55982   0.498
Cluster_10883   0.500
Cluster_16641   0.500
Cluster_20143   0.500
Cluster_57942   0.867
Cluster_32878   0.868
Cluster_26249   0.870
Cluster_46928   0.870
Cluster_41908   0.871
Cluster_28603   0.872
Cluster_1419    0.873

Thanks joran for the editing--I could not figure out how to do the formatting! — psaima, Jun 02 '12 at 05:55
Possible duplicate of : http://stackoverflow.com/questions/10030547/frequency-and-cumulative-frequency-curve-on-the-same-graph-in-r/10031056#10031056 — Etienne Low-Décarie, Jun 04 '12 at 11:30

thelatemail · Answer 1 · 2012-06-02T06:44:38.293

Here's a dump of the data as a data.frame called test:

test <- structure(list(cluster = structure(c(6L, 7L, 17L, 1L, 8L, 11L, 
15L, 2L, 4L, 5L, 16L, 12L, 9L, 14L, 13L, 10L, 3L), .Label = c("Cluster_10141", 
"Cluster_10883", "Cluster_1419", "Cluster_16641", "Cluster_20143", 
"Cluster_20637", "Cluster_20919", "Cluster_21451", "Cluster_26249", 
"Cluster_28603", "Cluster_30198", "Cluster_32878", "Cluster_41908", 
"Cluster_46928", "Cluster_55982", "Cluster_57942", "Cluster_9642"
), class = "factor"), value = c(0.02, 0.02, 0.147, 0.148, 0.148, 
0.148, 0.498, 0.5, 0.5, 0.5, 0.867, 0.868, 0.87, 0.87, 0.871, 
0.872, 0.873)), .Names = c("cluster", "value"), row.names = c(NA, 
-17L), class = "data.frame")

Which looks like:

         cluster value
1  Cluster_20637 0.020
2  Cluster_20919 0.020
3   Cluster_9642 0.147
<<snip>>
16 Cluster_28603 0.872
17  Cluster_1419 0.873

Generate a cumulative percentage variable

> test$cumperc <- (1:nrow(test))/nrow(test)
> test

         cluster value    cumperc
1  Cluster_20637 0.020 0.05882353
2  Cluster_20919 0.020 0.11764706
3   Cluster_9642 0.147 0.17647059
<<snip>>
14 Cluster_46928 0.870 0.82352941
15 Cluster_41908 0.871 0.88235294
16 Cluster_28603 0.872 0.94117647
17  Cluster_1419 0.873 1.00000000

Then plot the data

plot(test$value,test$cumperc,type="l",xlim=c(0,1))

enter image description here

Edit to address comment below:

Try this to group the clusters first:

tabvals <- table(test$value)
plot(names(tabvals),(1:length(tabvals))/length(tabvals),xlim=c(0,1),type="l")

Which gives this plot:

enter image description here

Thanks, but I think this solution calculates for cumulative percent for each cluster, right? I want to calculate number of clusters which has the same value( e.g. in the test data no. of clusters with the value 0.020 is 2), then calculate cumulative percent of that cluster frequency and plot against the parameter. How to do this? — psaima, Jun 02 '12 at 06:27

How to process data for a cumulative percent frequency plot in R

1 Answers1