0

Outliers between 1,5 - 3 times the interquantile range is marked with an "+" and above 3 times the IQR with an "o". But due to this data set with multiple outliers the below boxplot is very hard to read since the "+" and "o" symbols are plotted on top of each other creating what appears to be a thick red line.

I need to plot all data so removing them is not an option but I would be fine to display "longer" boxes, i.e. stretch the q1 and q4 to reach the true min/max values and skip the "+" and "o" outlier symbols. I would also be fine if just the min and max outliers was displayed.

I'm totally in the dark here and the octave boxplot documentation found here did not include any helpful examples on how to handle outliers. A search here at stackoverflow didn't get me closer to a solution either. So any help or directions is very appreciated!

How can I modify the below code to create a boxplot based on the same data set that is readable (i.e. doesn't plot outliers on top of each other creating a thick red line)?

enter image description here

I'm using Octave 4.2.1 64-bits on a Windows 10 machine with qt as the graphics_toolkit and with GDAL_TRANSLATE called from within Octave to handle the tif-files.

It's not an option to switch graphics_toolkit to gnuplot etc. since I haven't been able to "rotate" the plot (horizontal boxes instead of vertical). And it's in the .pdf file the results must have an effect, not only in octaves viewer.

Please forgive my totally "newbie-style" coding-work-around to get a proper high resolution pdf-exported:

pkg load statistics

clear all;
fns = glob ("*.tif");
for k=1:numel (fns)

  ofn = tmpnam;
  cmd = sprintf ('gdal_translate -of aaigrid "%s" "%s"', fns{k}, ofn);
  [s, out] = system (cmd);
  if (s != 0)
    error ('calling gdal_translate failed with "%s"', out);
  endif
  fid = fopen (ofn, "r");
  # read 6 headerlines
  hdr = [];
  for i=1:6
    s = strsplit (fgetl (fid), " ");
    hdr.(s{1}) = str2double (s{2});
  endfor
  d = dlmread (fid);

  # check size against header
  assert (size (d), [hdr.nrows hdr.ncols])

  # set nodata to NA
  d (d == hdr.NODATA_value) = NA;

  raw{k} = d;

  # create copy with existing values
  raw_v{k} = d(! isna (d));

  fclose (fid);

endfor

## generate plot
boxplot (raw_v)


set (gca, "xtick", 1:numel(fns),
          "xticklabel", strrep (fns, ".tif", ""));
          ylabel ("Plats kvar (meter)");

set (gca, "ytick", 0:50:600);
set (gca, "ygrid", "on");
set (gca, "gridlinestyle", "--");

set (gcf, "paperunit", "centimeters", "papersize", [35, 60], "paperposition", [0 0 60 30], "paperorientation", "landscape")          


zoom (0.95)
view ([90 90])

print ("loudden_box_dotted.pdf", "-F:14")
johlund
  • 137
  • 7
  • aren't you using a modified boxplot as I remember from your last question? – Andy Feb 13 '18 at 07:43
  • all types of boxplot are uniform in their use of the box (first and third quartile as begin and end of the box, second quartile as band/line) so you can't modify this and still call it boxplot. – Andy Feb 13 '18 at 08:45
  • I was using the modified boxplot.m but unfortunately it was too buggy and only worked about half of the time (too bad since it was much more good looking). I worked around the colors by converting the image to black and white, which looks a little bit better. But how would you handle the outliers in this data to avoid the "thick red line-problem"? Even if I (like I did for the report) created a legend with "+" and "o" with an explanation, you can hardly see that it's a matter of +/o, it just appears as a thick red line. – johlund Feb 13 '18 at 10:55
  • Do you think it would be possible to use a much smaller font size for the outliers only and would that increase the readability? – johlund Feb 13 '18 at 11:34
  • What exactly do you want? You can for example change the color for the two types of outliers (red/green) or just remove all of them – Andy Feb 13 '18 at 14:47
  • I need to keep them, or ideally i need the keep the max and min outlier. The only goal is to create a box plot that people can read i.e. aviod the thick red line-syndrome but still show them somehow. How would you have done it if you needed to keep the outliers? They are so many that I'm not sure simply changing their color would help? – johlund Feb 13 '18 at 16:29
  • changing the color of "x" and "o" at least would help to see the border between them. – Andy Feb 13 '18 at 16:32
  • How is that done and is it possible to change the font size for those symbols specifically? – johlund Feb 13 '18 at 17:09
  • I would mark an answer showing me a couple of ways to handle outlier visibility like changing color, size or removing them, as the accepted answer. – johlund Feb 14 '18 at 07:12
  • @Andy I misunderstood what "remove all of them" really meant. Removing them is exactly what I want to do. Should I do it by setting the "maxwhisker" to a high enough value so that the "+" and "o" symbols are never displayed or is there a better way? And where exactly do I write it in the code? boxplot (raw_v, maxwhisker="20")? – johlund Feb 19 '18 at 11:12

1 Answers1

2

I would just delete the outliers. This is easy because the handles are returned. I've also included some caching algorithm so you don't have to reload all tifs if you are playing with plots. Splitting the conversion, processing and plotting in different scripts is always a good idea (but not for stackoverflow where minimalistic examples are prefered). Here we go:

pkg load statistics

cache_fn = "input.raw";

# only process tif if not already done
if (! exist (cache_fn, "file"))
  fns = glob ("*.tif");
  for k=1:numel (fns)

    ofn = tmpnam;
    cmd = sprintf ('gdal_translate -of aaigrid "%s" "%s"', fns{k}, ofn);
    printf ("calling '%s'...\n", cmd);
    fflush (stdout);
    [s, out] = system (cmd);
    if (s != 0)
      error ('calling gdal_translate failed with "%s"', out);
    endif
    fid = fopen (ofn, "r");
    # read 6 headerlines
    hdr = [];
    for i=1:6
      s = strsplit (fgetl (fid), " ");
      hdr.(s{1}) = str2double (s{2});
    endfor
    d = dlmread (fid);

    # check size against header
    assert (size (d), [hdr.nrows hdr.ncols])

    # set nodata to NA
    d (d == hdr.NODATA_value) = NA;

    raw{k} = d;

    # create copy with existing values
    raw_v{k} = d(! isna (d));

    fclose (fid);

  endfor

  # save result
  save (cache_fn, "raw_v", "fns");
else
  load (cache_fn)
endif

## generate plot
[s, h] = boxplot (raw_v);

## in h you'll find now box, whisker, median, outliers and outliers2
## delete them
delete (h.outliers)
delete (h.outliers2)

set (gca, "xtick", 1:numel(fns),
          "xticklabel", strrep (fns, ".tif", ""));
          ylabel ("Plats kvar (meter)");

set (gca, "ytick", 0:50:600);
set (gca, "ygrid", "on");
set (gca, "gridlinestyle", "--");

set (gcf, "paperunit", "centimeters", "papersize", [35, 60], "paperposition", [0 0 60 30], "paperorientation", "landscape")          

zoom (0.95)
view ([90 90])

print ("loudden_box_dotted.pdf", "-F:14")

gives

generated plot

Andy
  • 7,931
  • 4
  • 25
  • 45
  • Thank you so much Andy! While working with the plot style I've actually been thinking "Andy would not have had to sit and wait for the script to read all the tifs over and over again, he would have known a better way". I somehow managed to go back to that modified version of boxplot.m (which didn't print outliers) and got it working, but now I might finally be learning how to create some boxplots by my self thanks to you. – johlund Feb 24 '18 at 15:38
  • Keep in mind that there is also the GNU Octave help mailinglist. But you should mention that you've also asked on Stackoverflow (if you've done) on both side to prevent doubling work for others. And there is #octave on IRC – Andy Feb 24 '18 at 17:42