0

I'm trying to figure out how to do this in python as I'm a bit newer to it vs R.

import plotnine as p9
import pandas as pd
import numpy as np

###load the data here...
dataset=pd.read_csv('https://gist.githubusercontent.com/curran/a08a1080b88344b0c8a7/raw/d546eaee765268bf2f487608c537c05e22e4b221/iris.csv')

example of what isn't working...not sure what I am getting wrong...

p9.ggplot(dataset, p9.aes(x='sepal_width'))+p9.geom_density()+p9.geom_vline( p9.aes(xintercept='sepal_length.mean()', color='species'))

Why is colors not working? I want a vertical line by group with the appropriate color

It would also be great if I could overlay the histogram.

Alexandre B.
  • 5,387
  • 2
  • 17
  • 40
runningbirds
  • 6,235
  • 13
  • 55
  • 94

1 Answers1

1

You have to do the data manipulation separately. Plotnine/ggplot will compute the correct mean if the computation is done in a stat. For your case the computation is done by a mapping, i.e. xintercept='sepal_length.mean()' maps xintercept to the sepal_length mean, it does not care about color='species', so xintercept is the global mean!

from plotnine import *
from plydata import *

df = (
    dataset
    >> group_by('species')
    >> summarise(sl_mean='mean(sepal_length)')
)

(ggplot(dataset, aes(x='sepal_width'))
 + geom_density()
 + geom_vline(df, aes(xintercept='sl_mean', color='species'))
)

Resulting Plot

To add a histogram

(ggplot(dataset, aes(x='sepal_width'))
 + geom_histogram(aes(y='stat(density)'), alpha=.2)
 + geom_density()
 + geom_vline(df, aes(xintercept='sl_mean', color='species'))
)

enter image description here

has2k1
  • 2,095
  • 18
  • 16
  • 1
    I used [plydata](https://github.com/has2k1/plydata) for the manipulation, because it was faster for me, but you can do it with straight pandas. – has2k1 Aug 15 '19 at 23:09
  • Do you find plydata better than dfplyr? I know their are a few variations of Dplyr implementations – runningbirds Aug 15 '19 at 23:27
  • 1
    @runningbirds, I am a little biased since I created both plotnine and plydata. The main issue I had with other dplyr implementations is the use of a manager variable X, which I find to be clunky. Otherwise, use whatever suits you. – has2k1 Aug 16 '19 at 10:22
  • ok i appreciate all the work @has2k1, plotnine is making move to python easier. – runningbirds Aug 16 '19 at 15:56