Data visulisation using ridge and scatter plot

Question

Background: I am working on python, I have a lot of data points (in .CSV form) so far what the code I have

Reads the csv and the "result" column
if the value in the "result" column is positive, the code plots the A B C D E F G parameters corresponding to it in such a way that the y-axis is the value of the parameters and x-axis is the name of the parameter.
If the number of such "result" are more than 10 It plots the first 10 A B C D E F G parameters corresponding to the results.

An example of the type of dataset is below. (Mine contains around 12000 rows)

The Dataset


  A     B       C     D       E     F    G    result
1.00   0.85  -0.999  0.27   0.98  0.39  0.80  -0.86
0.89   0.4   -0.6    0.47   0.28  0.29  0.26   0.65
0.65  -1.00   0.26   0.67  -0.88  0.29  0.10   0.50
0.98  -0.98   0.76   0.37   0.68  0.59  0.90      0
   0   0.5    0.56   0.27   0.38  0.79  0.48  -0.65

The code :

df = pd.read_csv("result.csv")
df.loc[df.result>0, df.columns[:-1]].T.plot(ls='', marker='o')

Issue : Sometimes if the value is the same the dot mark is at the same place thus it's hard to see the frequency distribution(such as in Column B and C below though they look similar one value has more points.

What I want to do is to plot something like a ridge plot on the current graph (as I drew below )so that the frequency distribution can be seen. I am a novice in this type of data visualization. Kindly guide me on how it could be done

Could a [violin plot](https://www.python-graph-gallery.com/violin-plot/) be what you're looking for? — applesoup, Jun 18 '21 at 11:06

Cimbali · Accepted Answer · 2021-06-18T11:47:37.620

The density plot type already does pretty much what you want, we just need to superpose it to your data:

>>> data_to_plot = df.loc[df.result>0, df.columns[:-1]]
>>> data_to_plot.plot(kind='density')

This is trivial if you want horizontal subplots, you can simply use the subplots=True on either plot (and then zip the returned axes with columns to superpose the other plot):

>>> axes = data_to_plot.plot(kind='density', subplots=True, legend=False)
>>> for ax, (colname, series) in zip(axes, data_to_plot.iteritems()):
...     ax.plot(series.values, np.zeros_like(series), ls='', marker='o')
...     ax.set_ylabel(colname)

However if you want them vertically it’s likely we’ll have to compute the Gaussian densities ourselves. Pandas documentation points to scipy.stats.gaussian_kde. For this we’ll need to know at which points to interpolate the kernel. On your example it looks like [-1..1] is a good interval but of course you can take it from data min/max.

>>> from scipy.stats import gaussian_kde
>>> y = np.arange(-1, 1.01, .01)
>>> ridges = data_to_plot.apply(lambda s: gaussian_kde(s)(y))
>>> ridges
            A         B         C             D         E             F         G
0    0.001119  0.271510  0.270048  2.029737e-24  0.163222  2.352981e-15  0.000018
1    0.001247  0.272310  0.272122  4.796826e-24  0.164507  3.959987e-15  0.000021
2    0.001389  0.273071  0.274155  1.125941e-23  0.165765  6.637610e-15  0.000025
3    0.001545  0.273794  0.276145  2.624972e-23  0.166995  1.108083e-14  0.000030
4    0.001717  0.274479  0.278093  6.078288e-23  0.168200  1.842365e-14  0.000036
..        ...       ...       ...           ...       ...           ...       ...
196  0.939109  0.307535  0.314227  3.791151e-02  0.436305  3.153771e-01  0.630121
197  0.932996  0.304793  0.310216  3.100156e-02  0.431472  2.913782e-01  0.615406
198  0.926089  0.302012  0.306172  2.518140e-02  0.426576  2.682819e-01  0.600298
199  0.918401  0.299193  0.302097  2.031681e-02  0.421619  2.461581e-01  0.584834
200  0.909948  0.296337  0.297994  1.628194e-02  0.416607  2.250649e-01  0.569049

[201 rows x 7 columns]

Then simply ploy with zip, as before. There might be some adjustment needed, but this is how it looks like with your sample data. Note the scaling of ridges so they are all on the same scale and fit inside a 0.5-wide space on the plot.

>>> ax = data_to_plot.T.plot(ls='', marker='o')
>>> for n, (colname, ridge) in enumerate(ridges.iteritems()):
...     ax.plot(ridge / (-2 * ridges.max().max()) + n, y, color='black')

Thank you ! this was a wonderful explanation ! However my concern is if it takes the (if result = positive then plot A B C D E F G ) condition into account? — user157522, Jun 18 '21 at 12:09
Yes @user157522 I start by defining `data_to_plot = df.loc[df.result>0, df.columns[:-1]]` so everything that uses `data_to_plot` only uses lines with `result > 0` — Cimbali, Jun 18 '21 at 12:15
I get the following error , I am using google colab for my work . Kindly guide me on how to handle it 'numpy.ndarray' object has no attribute 'iteritems' I change "iteritems" to "items" but it did not help. — user157522, Jun 19 '21 at 07:41
`iteritems` should be called on a pandas dataframe, not on a numpy array — Cimbali, Jun 19 '21 at 08:06

Data visulisation using ridge and scatter plot

1 Answers1