1

I subset and melt an Airbnb dataset and try to plot a grouped chart:

from plotnine import *

airbnb_melted = pd.melt(airbnb_newcomers, id_vars =['host_id'], value_vars =['host_identity_verified', 'host_is_superhost']) 
print(airbnb_melted)

The melted dataset looks like:

enter image description here

I know my following code is wrong, and the output of the plot is not what I want but it is closest to my idea:

ggplot(airbnb_melted, aes(x='variable', y='value')) +\
        geom_bar(stat = 'sum', position=position_dodge())

I have searched online and found lots of plot examples with y as numerical variable and stat='count' that could be used. However, y here is categorical and it shows error PlotnineError: 'stat_count() must not be used with a y aesthetic'

How could I plot a grouped bar chart similar to the following format? The orange words are what I have added as indication. Thank you.

enter image description here

Updates on Jan.20, 2020: Thanks to @StupidWolf help, the coding works as follows:

airbnb_host_count = airbnb_melted.replace(np.NaN, 'NA').groupby(['value', 'variable']).count().reset_index()

enter image description here

'host_id' actually indicates counts here:

ggplot(airbnb_host_count, aes(x='variable', y='host_id', fill='value')) +\ 
    geom_bar(stat='sum', position=position.dodge())

enter image description here

vcai01
  • 53
  • 1
  • 1
  • 8

1 Answers1

2
Try this:

from plotnine import *
import pandas as pd
import numpy as np
import random

random.seed(99)
airbnb_melted = pd.DataFrame(
    {'host_id':np.arange(20),
     'variable': np.repeat(['host_identity_verified','host_is_superhost'],[10,10]) ,
     'value' : random.choices(['t','f','NA'],k=20)
    })

I do not have your dataframe, so check what exactly is the NA value, and replace it like this, for example if it is NaN

airbnb_melted = airbnb_melted.replace(np.NaN,'NA')

We can check the counts:

airbnb_melted.groupby(['value','variable']).count()

value   variable    
NA  host_identity_verified  3
host_is_superhost   2
f   host_identity_verified  3
host_is_superhost   6
t   host_identity_verified  4
host_is_superhost   2

Now we plot, you set fill = 'value' and do not set 'stat', because the default is 'count' which tallies your t, f and NA:

ggplot(airbnb_melted, aes(x='variable', fill='value')) +\
        geom_bar(position=position_dodge())

enter image description here

StupidWolf
  • 45,075
  • 17
  • 40
  • 72
  • Thank you for your help, StupidWolf, and I am sorry I respond you late. I think the ```NA``` data you simulated is actually a string rather than a null value. When I used your coding in my data, the null value was not countable, and it showed ```TypeError: ("'<' not supported between instances of 'float' and 'str'", 'occurred at index fill')```. Is there any alternative way? I don't want to delete the missing value if possible. Thank you. – vcai01 Jan 20 '20 at 21:51
  • I see. let me try to replace the NA. But what is the actual value you have? Because in python there's no NA, it is NaN I guess? – StupidWolf Jan 20 '20 at 21:54
  • HI SupidWolf, I think I figure it out: ```ggplot(airbnb_melted, aes(x='variable',y='host_id', fill = 'value)) + geom_bar(stat = 'sum', position=position_dodge())```. I'll show you the screen shot. – vcai01 Jan 20 '20 at 22:05
  • 1
    Do this, airbnb_melted.replace(np.NaN,'NA').groupby(['value','variable']).count() – StupidWolf Jan 20 '20 at 22:05