0

I am making a comparison of different algorithms with dependence on the properties of the datasets, and I am watching the execution time. Because there might exist multiple observations for one value of the property, I created a line graph, where lines would correspond to the average values of execution times. However, I also wanted to see extremes and quartiles, so my first idea was to add to the relevant places some candlesticks showing relevant values.

I expected that it should look something like this:

example of result

My data are in form of csv with relevant values in it:

size, GSP_min, GSP_firstQuartile, GSP_median, GSP_avg, GSP_thirdQuartile, GSP_max, SPAM_min, SPAM_firstQuartile, SPAM_median, SPAM_avg, SPAM_thirdQuartile, SPAM_max, PREFIX_SPAN_min, PREFIX_SPAN_firstQuartile, PREFIX_SPAN_median, PREFIX_SPAN_avg, PREFIX_SPAN_thirdQuartile, PREFIX_SPAN_max
498101.0, 101.0, 101.0, 385.6666666666667, 340.0, 716.0, 11.0, 11.0, 11.0, 33.666666666666664, 29.0, 61.0, 49.0, 49.0, 49.0, 60.333333333333336, 56.0, 76.0, 
730189.0, 189.0, 189.0, 3489.0, 3740.0, 6538.0, 19.0, 19.0, 19.0, 106.66666666666667, 114.0, 187.0, 32.0, 32.0, 32.0, 69.66666666666667, 81.0, 96.0, 

Here is my code and how I planned to achieve it:

set terminal png size 1024,1024
set bmargin 5
set key autotitle columnhead

set datafile separator ","

set style line 1 \
    linecolor rgb '#00ff00' \
    linetype 1 linewidth 2 \
    pointtype 7 pointsize 1.5

set style line 2 \
    linecolor rgb '#0000ff' \
    linetype 1 linewidth 2 \
    pointtype 7 pointsize 1.5

set style line 3 \
    linecolor rgb '#ff0000' \
    linetype 1 linewidth 2 \
    pointtype 7 pointsize 1.5

set boxwidth 0.1 relative
set style fill empty

set output 'sizeExp.png'
plot 'size.csv' using 1:4 with lp ls 1, \
         '' using 1:9 with lp ls 2, \
         '' using 1:14 with lp ls 3, \
         '' using ($1-1):3:2:6:5 with candlesticks whiskerbars, \
         '' using ($1):8:7:11:10 with candlesticks whiskerbars, \
         '' using ($1+1):13:12:16:15 with candlesticks whiskerbars

This is the generated result: enter image description here The problem here is twofold:

  1. Because the values differ a lot, I am not able to set a width. I thought I would manage to do it somehow logically with the "relative" keyword, but instead, I got really weird widths of the boxes.
  2. Secondly, I am not managing to put these bars next to each other and instead, I am getting them overlapped. I tried different values in the x = "($1+1)" position, but nothing gave me a good result.

Is there a way how to modify values relatively to image size?

And the third problem, if someone could give me some advice, I expected that line would be named "GSP_avg", "SPAM_avg", and "Prefix_span_avg", but instead, I got that mess.

Jakub Peschel
  • 135
  • 1
  • 8
  • Is it important to have the x-axis to scale? If yes, what should be the boxwidth with unequally distributed candlesticks? Would just 7 equally spaced candlesticks (or group of candlesticks) with the size as xtic be fine? About the legend: check `help key` option `noenhanced`. – theozh Jun 14 '22 at 15:15
  • Well, the placement of the candlesticks matter, because it corresponds to the value of the parameter, which is in this case size of the dataset. Maybe to add, I don't mind if different groups will be partially overlapping, but I would like to have candlesticks for one x value to be nonoverlapping and with some relatively nice size. – Jakub Peschel Jun 14 '22 at 15:24
  • hmm, the distances around 5e-7 is rather small. It will be difficult to choose a reasonable boxwidth for all candlesticks. Can you please provide the full data or at least some realistic minimized data? I guess for the y-scale it will also be difficult to display the huge differences. – theozh Jun 14 '22 at 17:33

2 Answers2

1
  1. Your boxwidth: relative to what? Your x-coordinates (column 1) are in the order of 1e5 to 1e6. Hence you should set the boxwidth in the order of 50000 to 100000 absolute. Check help boxwidth.

  2. Same for the offsets. An offset of ($1+50000) seems to be reasonable.

  3. Switch the key to noenhanced mode. Check help key.

I see another challenge: Your y-values span more than 3 orders of magnitude. It will be difficult to see them all at once. In the example below, I tried to set logscale y, but candlesticks in logscale look strange/unusual/confusing to me. Maybe there is another way to display or group your data.

Script:

### candlesticks grouped/with offset
reset session

$Data <<EOD
size, GSP_min, GSP_firstQuartile, GSP_median, GSP_avg, GSP_thirdQuartile, GSP_max, SPAM_min, SPAM_firstQuartile, SPAM_median, SPAM_avg, SPAM_thirdQuartile, SPAM_max, PREFIX_SPAN_min, PREFIX_SPAN_firstQuartile, PREFIX_SPAN_median, PREFIX_SPAN_avg, PREFIX_SPAN_thirdQuartile, PREFIX_SPAN_max
498101.0, 101.0, 101.0, 385.6666666666667, 340.0, 716.0, 11.0, 11.0, 11.0, 33.666666666666664, 29.0, 61.0, 49.0, 49.0, 49.0, 60.333333333333336, 56.0, 76.0, 
730189.0, 189.0, 189.0, 3489.0, 3740.0, 6538.0, 19.0, 19.0, 19.0, 106.66666666666667, 114.0, 187.0, 32.0, 32.0, 32.0, 69.66666666666667, 81.0, 96.0, 
EOD

set datafile separator ","
set style line 1 lc rgb '#00ff00' lw 2 pt 7 ps 1.5
set style line 2 lc rgb '#0000ff' lw 2 pt 7 ps 1.5
set style line 3 lc rgb '#ff0000' lw 2 pt 7 ps 1.5

set key autotitle columnhead noenhanced top left

set style fill empty
set boxwidth 1e4
set offsets graph 0.15, graph 0.15, graph 0.1, graph 0.1
set xtics 1e5
set logscale y

plot $Data u 1:4 w lp ls 1, \
         '' u 1:9 w lp ls 2, \
         '' u 1:14 w lp ls 3, \
         '' u ($1-5e4):3:2:6:5     w candlesticks whiskerbars, \
         '' u 1:8:7:11:10          w candlesticks whiskerbars, \
         '' u ($1+5e4):13:12:16:15 w candlesticks whiskerbars
### end of script

Result:

enter image description here

theozh
  • 22,244
  • 5
  • 28
  • 72
  • I discovered one problem in the export of data, that can be responsible partially for the shape of candlesticks. I accidentally did not put a comma separator between size and GSP_min causing the shift in the parameters. – Jakub Peschel Jun 16 '22 at 12:37
  • Ideally, relative to image size, to have a fixed length of the box. Would be perfect if I could set the size of the boxwidth to something like 10px or some other value not connected to x values. At the moment I achieved something like that by using stats and computing the width value from the min and max value of the x axes. – Jakub Peschel Jun 16 '22 at 13:45
  • 1
    @JakubPeschel yes, that's one option which is relative to x-axis range. If you want "pixel" dimensions you can calculate something using the `GPVAL_` variables (check `help GPVAL`). But it will require some replot because they contain the values of graph size etc. only _after_ plotting. – theozh Jun 16 '22 at 13:59
1

I suggest that you look into the with boxplot style, which would calculate quartiles and construct appropriate candlestick-like plots directly from the data.

Here is an online demo for gnuplot boxplots.

See also the answer provided for this earlier question: How to plot grouped boxplot by gnuplot

Unlike the with candlesticks plot style, you can provide individual widths for the boxplots. There is also control over clustering and spacing between members of the cluster. From the documentation:

 By default only one boxplot is produced that represents all y values from the
 second column of the using specification. However, an additional (fourth)
 column can be added to the specification. If present, the values of that
 column will be interpreted as the discrete levels of a factor variable.
 As many boxplots will be drawn as there are levels in the factor variable.
 The separation between these boxplots is 1.0 by default, but it can be changed
 by `set style boxplot separation`. By default, the value of the factor variable
 is shown as a tic label below (or above) each boxplot.

Example

 # Suppose that column 2 of 'data' contains either "control" or "treatment"
 # The following example produces two boxplots, one for each level of the
 # factor
 plot 'data' using (1.0):5:(0):2

 The default width of the box can be set via `set boxwidth <width>` or may be
 specified as an optional 3rd column in the `using` clause of the plot command.
 The first and third columns (x coordinate and width) are normally provided as
 constants rather than as data columns.
Ethan
  • 13,715
  • 2
  • 12
  • 21