0

I have data that I want to plot as a violin plot. For flexibility reasons and compatability, I am using HvPlot with Bokeh extension However the output of the plot seems very weird in my opinion.

Here is the code:

import hvplot.pandas

import pandas as pd
import numpy as np

hvplot.extension('bokeh')

temp = [3.021e+00, 1.726e+00, 2.092e+00, 1.721e+00,    np.nan, 3.182e+00, 8.335e+00,
        9.851e+00, 9.743e+00, 3.153e+01,    np.nan, 2.458e+01, 1.138e+01, 9.886e+01,
        1.638e+01, 1.084e+00, 1.573e+00, 2.486e-01, 4.257e-01, 2.447e-01, 3.540e-01,
        3.644e-01, 3.296e-01, 3.450e-01, 2.739e-01, 3.404e-01, 4.427e-01, 8.187e-01,
        4.360e-01, 3.556e-01, 3.107e+00, 9.722e-01, 1.572e+00, 7.006e-01, 2.344e-01,
        9.597e-01, 1.372e-01, 2.530e-01, 1.100e+00, 5.791e+00, 1.093e+01, 4.249e+00,
        4.841e+00, 5.234e+00, 1.355e+01, 1.423e+01, 2.478e+01,    np.nan, 4.721e+01,
           np.nan, 9.234e+01,    np.nan, 1.045e+02,    np.nan, 8.989e+01,    np.nan,
        9.330e+01,    np.nan, 1.151e+02,    np.nan, 2.193e+02,    np.nan, 1.705e+02,
           np.nan, 3.835e+02,    np.nan, 1.689e+03,    np.nan, 1.879e+03,    np.nan,
        1.898e+03,    np.nan, 2.374e+03,    np.nan, 3.079e+03, 3.611e+03, 3.926e+03]

test = pd.Series(temp)

plt = test.hvplot.violin()
hvplot.show(plt)

As you can see, my data contains a wide range of values and some NaN. I am sure that violin filters these but the problem seems elsewhere.

When plotting as is, the distribution seem to be coherent with data good violin plot

An issue occurs when I want to visualize it in log scale by writing test.hvplot.violin(logy=True) instead of test.hvplot.violin()

Indeed, when using this option, the plot looks like this: weird violin plot

As I understand, a violin plot should show different quartiles similarly to a boxplot but in addition should show the smooth distribution of data. Therefore, it should be logical for this distribution to engulf the boxplot, which is not the case here since Q2 is outside the distribution.

What I believe happens is related to the way the distribution is computed. As it also happens with simple histograms, bin size is not necessarily adapted to the log scale and therefore, most of the data distribution is concentrated in a very wide bin that here corresponds to a single point in the distribution.

A possible solution would be to apply np.log10 to my data and plot the violin, which gives good log violin plot

But my project makes this solution impossible to implement. I wonder if there is a work around this problem. Like an option that adapts the bins or automatically applies np.log10 to the data and keeps the log scale in the plot.

Thanks in advance !

Murnawful
  • 135
  • 1
  • 12

0 Answers0