0

My data file is a set of sorted single-column:

1
1
2
2
2
3
...
999
1000
1000

I am able to successfully plot the CDF using the command like (assuming 10000 lines in the file):

plot "file" using 1:(1/10000.) smooth cumulative title "CDF"

I am also able to plot the logcale of x axis by:

set logscale x

My problem is how can I have a CCDF plotting with Gnuplot?

In additional, the CDF with log-log scale (set logscale xy) can not give me any output. What if I would like to have a log-log CCDF plotting?

Many thanks!

haos
  • 25
  • 3
  • I know what a CDF is, but what is a CCDF? And you can use `... using 1:(1) smooth cnormal` to plot a CDF. What is the error you get with `set logscale` (without any arguments)? – Christoph Jul 24 '15 at 07:11
  • CCDF means Complementary-CDF, where the y axis is reversed from 1-0 (down to up) and distribution is cumulated with "greater than" ("less than" in CDF plotting) – haos Jul 24 '15 at 23:13

1 Answers1

0

I found a workaround for this problem, because I do not think you can plot a CCDF only using gnuplot.

Briefly, I just parsed my data using bash to create a dataset where the cumulative data is explicit; then gnuplot may simply plot the new dataset. As an example, assuming that your file contains the (numerical) values you want to cumulate, I would do in a bash environment:

cat data | sort -n | uniq --count | awk 'BEGIN{sum=0}{print $2,$1,sum; sum=sum+$1}' > parsed.dat'

This command reads the dataset (cat data), sorts the numerical data using their value (sort -n), counts the occurrences of each sample (uniq --count) and creates a new dataset, calculating as well the cumulative sum of each data value (the awk command).

This new dataset contains 3 columns: the first column ($1 in gnuplot) contains the unique values of your dataset, the $2 contains the number of the occurrences of your values, and the third column represents the cumulative sum.

Finally, in gnuplot, you can do this:

stats "parsed.dat" using 3;
plot "parsed.dat" using 1:($3/STATS_max) with lines title "CDF",\
"" using 1:(1-$3/STATS_max)  with lines title "CCDF",\
"" using 1:($2/STATS_max) with boxes title "PDF"

The stats command of gnuplot analyzes the third column (the one with the cumulative sum) and stores the values to some variables. STATS_max is the max value of this column (so it is the final cumulative sum). Now you have all the data you need to plot not only the CDF, but also the CCDF (which is 1 - CDF) and also the PDF (or the normalized histogram, for discrete values).

theleos
  • 123
  • 7