Gadfly Histogram Appears to Select the Wrong Support

Question

I am new to Julia, and I am exploring the ways in which I can visualize distributions. Ultimately, I may fall back on the more robust matplotlib code base, but I really enjoy the dynamic visualization element that IJulia offers in the Notebook.

My issue concerns plotting histograms of proportional values with Gadfly. I am able to plot the kernel density with Gadfly automatically selecting a reasonable support (a.k.a. one consistent with the underlying data: [-0.1,0.5]).

#Visualization
using Gadfly

#(Python) pandas analogue
require("DataFrames")

#Practice sets
require("RDatasets")

#Extract the Hedonic set
hedonic=RDatasets.dataset("plm","Hedonic")

#Define density plot layer for black population proportion
dens_layer=layer(hedonic,x=:Blacks,Geom.density,Theme(default_color=color("#de2d26")))

#Plot kernel density
dens_plot=plot(dens_layer, Guide.title("Distribution of Black Proportion"))

enter image description here

The histogram plot, however, is fit to a support that is far too large ([0,4]). All of the relevant data are captured by a single bar that spans the entire [0,1] interval.

#Define histogram layer
hist_layer=layer(hedonic,x=:Blacks,Geom.histogram,Theme(default_color=color("#de2d26"))) 

#Plot histogram
hist_plot_default=plot(hist_layer, Guide.title("Distribution of Black Proportion"))

enter image description here

When I increase the bincount, the support just grows. For example, with bincount=100, the support grows to [0,150], with all of the data still represented by a single bar.

#Plot histogram again, this time with 100 bins
hist_plot_bin100=plot(hedonic,x=:Blacks,Geom.histogram(bincount=100),Theme(default_color=color("#de2d26")))

enter image description here

So, if anyone can tell me what I am screwing up, it would certainly be appreciated. Alternatively, perhaps restricting the range would force the appropriate allocation of histogram bars...? To that end, how do I restrict the range so that I can view the distribution on the [0,1] interval?

Abhijith · Answer 1 · 2016-08-27T19:21:42.090

This issue is fixed, please see the below results for your example,

julia> using Gadfly, DataFrames, RDatasets
julia> hedonic=RDatasets.dataset("plm","Hedonic")
julia> hist_layer=layer(hedonic,x=:Blacks,Geom.histogram,Theme(default_color=color("#de2d26")))
julia> hist_plot_default=plot(hist_layer, Guide.title("Distribution of Black Proportion"))

julia> hist_plot_bin100=plot(hedonic,x=:Blacks,Geom.histogram(bincount=100),Theme(default_color=color("#de2d26")))

score 0 · Answer 2 · answered Sep 21 '14 at 22:58

0

First of all, I can recreate this. I think it goes down this branch in the code

https://github.com/dcjones/Gadfly.jl/blob/040606f82c4e014611464068b0d5cda111b6662a/src/statistics.jl#L136-L143

    isdiscrete = false
    value_set = collect(Set(values[Bool[Gadfly.isconcrete(v) for v in values]]))
    sort!(value_set)


    if  length(value_set) / length(values) < 0.9
        d, bincounts, x_max = choose_bin_count_1d_discrete(
                    values, value_set, stat.minbincount, stat.maxbincount)

which is odd as it isn't discrete data, and shouldn't be discrete bins. If it uses the choose_bin_count_1d instead it gets much more sensible answers. I think the thing with bincount changing the support is a related bug probably, but not sure how that happens. You should file an issue on the Gadfly github page.

answered Sep 21 '14 at 22:58

IainDunning

11,546
28
43

1

Thanks for taking a look, it's informative to know I am not the only one with this result. I have opened an issue [here](https://github.com/dcjones/Gadfly.jl/issues/435) (#435). – Marvin Ward Jr Sep 21 '14 at 23:13
1

scanning `choose_bin_count_1d_discrete`, it seems the calculation of mingap uses the indices instead of the values at: `mingap = mingap == zero(eltype(xs)) ? b - a : min(b - a, mingap)`. better post this in the open issue. – Dan Getz Sep 22 '14 at 23:17
That stood out to me too, I assume it makes more sense for actually discrete data, – IainDunning Sep 22 '14 at 23:44

Gadfly Histogram Appears to Select the Wrong Support

2 Answers2