0

dataset: https://github.com/rashida048/Datasets/blob/master/StudentsPerformance.csv

from bokeh.models import Range1d #used to set x and y limits #p.y_range=Range1d(120, 230)

def box_plot(df, vals, label, ylabel=None,xlabel=None,title=None):

 
    # Group Data frame
    df_gb = df.groupby(label)
    # Get the categories
    cats = list(df_gb.groups.keys())

    # Compute quartiles for each group
    q1 = df_gb[vals].quantile(q=0.25)
    q2 = df_gb[vals].quantile(q=0.5)
    q3 = df_gb[vals].quantile(q=0.75)
                       
    # Compute interquartile region and upper and lower bounds for outliers
    iqr = q3 - q1
    upper_cutoff = q3 + 1.5*iqr
    lower_cutoff = q1 - 1.5*iqr

    # Find the outliers for each category
    def outliers(group):
        cat = group.name
        outlier_inds = (group[vals] > upper_cutoff[cat]) \
                                     | (group[vals] < lower_cutoff[cat])
        return group[vals][outlier_inds]

    # Apply outlier finder
    out = df_gb.apply(outliers).dropna()

    # Points of outliers for plotting
    outx = []
    outy = []
    for cat in cats:
        # only add outliers if they exist
        if cat in out and not out[cat].empty:
            for value in out[cat]:
                outx.append(cat)
                outy.append(value) 
                
    # If outliers, shrink whiskers to smallest and largest non-outlier
    qmin = df_gb[vals].min()
    qmax = df_gb[vals].max()
    upper = [min([x,y]) for (x,y) in zip(qmax, upper_cutoff)]
    lower = [max([x,y]) for (x,y) in zip(qmin, lower_cutoff)]

    cats = [str(i) for i in cats]
    # Build figure
    p = figure(sizing_mode='stretch_width', x_range=cats,height=300,toolbar_location=None)
    p.xgrid.grid_line_color = None
    p.ygrid.grid_line_width = 2
    p.yaxis.axis_label = ylabel
    p.xaxis.axis_label = xlabel
    p.title=title
    p.y_range.start=0
    p.title.align = 'center'
    
    # stems
    p.segment(cats, upper, cats, q3, line_width=2, line_color="black")
    p.segment(cats, lower, cats, q1, line_width=2, line_color="black")

    # boxes
    p.rect(cats, (q3 + q1)/2, 0.5, q3 - q1, fill_color=['#a50f15', '#de2d26', '#fb6a4a', '#fcae91', '#fee5d9'], 
           alpha=0.7, line_width=2, line_color="black")

    # median (almost-0 height rects simpler than segments)
    p.rect(cats, q2, 0.5, 0.01, line_color="black", line_width=2)

    # whiskers (almost-0 height rects simpler than segments)
    p.rect(cats, lower, 0.2, 0.01, line_color="black")
    p.rect(cats, upper, 0.2, 0.01, line_color="black")

    # outliers
    p.circle(outx, outy, size=6, color="black")

    return p

p = box_plot(df, 'Total', 'race/ethnicity', ylabel='Total spread',xlabel='',title='BoxPlot')
show(p)

Boxplot

Hi there, from the code and dataset above I am able to produce a boxplot considering I pass through categorical variables. however I am unable to produce anything when I try to produce a boxplot for a single column. for example just checking the spread of the math scores. i tried to do

cats = df['math score'] 

but it didnt work. any suggestions?

mosc9575
  • 5,618
  • 2
  • 9
  • 32
  • The code looks very familiar to me. Can you plese describe your problems (any error message) and maybe check the post [here](https://stackoverflow.com/questions/71978438/why-doesnt-bokeh-boxplot-appear/71992515#71992515). This could be a duplicate. – mosc9575 Apr 29 '22 at 14:48
  • @mosc9575 yes its me again applying the code to different data sets. its fine with categorical values but im unsure of what to do/adjust when wanting to get a boxplot for an entire numeric column – curiouscoder Apr 29 '22 at 16:35

1 Answers1

1

I am not sute if this it is the best to implement this both in one function, but if this is your goal, one solution can be, to add a few if-else conditions.

Here is a description of the changes:

First give label a default.

# old
# def box_plot(df, vals, label, ylabel=None,xlabel=None,title=None):
# new
def box_plot(df, vals, label=None, ylabel=None,xlabel=None,title=None):

Then add a if-else part for the groupby section.

# old
# # Group Data frame
# df_gb = df.groupby(label)
# # Get the categories
# cats = list(df_gb.groups.keys())

# new
if label is not None:
    # Group Data frame
    df_gb = df.groupby(label)
    # Get the categories
    cats = list(df_gb.groups.keys())
else:
    df_gb = df[[vals]]
    cats = [vals]

Now the calculation for the outliners is a bit different, because we don't have to loop over a number of columns. Only onw column is left.

if label is not None:
    out = df_gb.apply(outliers).dropna()
else:
    out = df[(df[vals] > upper_cutoff) | (df[vals] < lower_cutoff)]

The upper and lower part are now floats and not a list.

if label is not None:
    upper = [min([x,y]) for (x,y) in zip(qmax, upper_cutoff)]
    lower = [max([x,y]) for (x,y) in zip(qmin, lower_cutoff)]
else:
    upper =min(qmax, upper_cutoff)
    lower =max(qmin, lower_cutoff)

I also added (changed) the line below, to avoid a warning.

colors = ['#a50f15', '#de2d26', '#fb6a4a', '#fcae91', '#fee5d9'][:len(cats)]
p.rect(cats, (q3 + q1)/2, 0.5, q3 - q1, fill_color=colors, alpha=0.7, line_width=2, line_color="black")

With these changes the output for

p = box_plot(df, 'math score', 'race/ethnicity', ylabel='Total spread',xlabel='',title='BoxPlot')

is still the same, but

p = box_plot(df, 'math score', ylabel='Total spread',xlabel='',title='BoxPlot')

gives us now a boxplot.

box plot for "math score"

mosc9575
  • 5,618
  • 2
  • 9
  • 32
  • you are truly magnificent @mosc9575! is there anything/a course/content you would recommend to familiarise and learn the coding structures provided in this code to enable me to upskill. I have been using platforms such as youtube/Udemy to learn data analysis for python but the examples provided are very basic and simple, and dont go into advanced concepts, like how to put all the different elements together. any guidance would be great as I am very eager and curious to learn – curiouscoder May 03 '22 at 10:16
  • is there a way to get boxplots visualised horizontally? I have tried using df.transpose(), and also use y_range replacing it with x_range, but all i get is a blank graph? – curiouscoder May 03 '22 at 14:34
  • Most of this code is related to bokeh and bokeh api. The best place to read about this is [here](https://docs.bokeh.org/en/latest/docs/gallery/boxplot.html) or in general [bokeh website](https://docs.bokeh.org/en/latest/). To flip the axis you only have to change `x_range=cats` to `y_range=cats` and check where you use `cats`. This is something you can aolve by try and error. – mosc9575 May 03 '22 at 14:50
  • And of course you have to use [`hbar()`](https://docs.bokeh.org/en/latest/docs/reference/plotting/figure.html?highlight=hbar#bokeh.plotting.Figure.hbar) instead of `vbar()`. – mosc9575 May 03 '22 at 14:57
  • 1) I have followed your guidance and altered xrange and hbar accordingly. the boxes became horizontal but the whiskers are outliers completely disappeared – curiouscoder May 03 '22 at 15:42
  • 2) I have tried adding hover Tools to show information at each point eg. just before "return p" I have added p.add_tools(HoverTool(tooltips=[('Math Score','@ms')])) and before defining the function right at the beginning i did: source=ColumnDataSource(data=dict(ms=df.MathScore)) (I am paraphrasing and speaking from the top of my head in terms of column names) hover tool appears but when I hover over each outlier only "???" is displayed. any help? – curiouscoder May 03 '22 at 15:47
  • The `???` appears if bokeh finds not a valid source. The best is to address your problem in a new question. It is hard to explain without seaing your code and with only a few letters in the comment section. – mosc9575 May 03 '22 at 19:28