I have a Pandas dataframe of data on generator plant capacity (MW) by fuel type. I wanted to show the estimated distribution of plant capacity in two different ways: by plant (easy) and by MW (harder). Here's an example:
# import libraries
import pandas as pd
import numpy as np
import seaborn as sns
# generate empty dataframe
df = pd.DataFrame(data=None,columns=['Fuel','MW'])
# create and seed a randomstate object (to make #s repeatable below)
rnd = np.random.RandomState(7)
# generate fake data for each fuel type and append to df
for myfuel in ['Biomass','Coal','Hydro','Natural Gas','Oil','Solar','Wind','Other']:
mymean = rnd.uniform(low=2.8,high=3.2)
mysigma = rnd.uniform(low=0.6,high=1.0)
df = df.append(
pd.DataFrame({'Fuel': myfuel,
'MW': np.array(rnd.lognormal(mean=mymean,sigma=mysigma,size=1000))
}),
ignore_index=True
)
# make violinplot
sns.violinplot(x = 'Fuel',
y = 'MW',
data=df,
inner=None,
scale='area',
cut=0,
linewidth=0.5
)
And here's the plot of the estimated distributions of plant size by MW this code makes:
This violinplot is very deceptive, without more context. Because it is not weighted, the thin tail at the top of each category hides the fact that the relatively few plants in the tail contain a lot of (maybe even most of) the MWs of capacity. So I want a second plot with the distribution by MWs--basically a weighted version of this first violinplot.
I wanted to know if anyone has figured out an elegant way to make such a "weighted" violinplot, or if anyone has an idea about the most elegant way to do that.
I figured I could loop through each row of my plant-level dataframe and decompose the plant data (into a new dataframe) into MW-level data. For instance, for a row in the plant-level dataframe that shows a plant with 350 MW, I could decompose that into 3500 new rows of my new dataframe, each representing 100 kW of capacity. (I think I have to go to at least the 100 kW level of resolution, because some of these plants are pretty small, in the 100 kW range.) That new dataframe would be enormous, but I could then do a violinplot of that decomposed data. That seemed slightly brute force. Any better ideas for approach?
Update:
I implemented the brute force method described above. Here's what it looks like if anyone is interested. This is not the "answer" to this question, because I still would be interested if anyone knows a more elegant/simple/efficient way to do this. So please chime in if you know of such a way. Otherwise, I hope this brute force approach might be helpful to someone in the future.
So that it's easy to see that the weighted violinplot makes sense, I replaced the random data with a simple uniform series of numbers from 0 to 10. Under this new approach, the violinplot of df should look pretty uniform, and the violinplot of the weighted data (dfw) should get steadily wider towards the top of the violins. That's exactly what happens (see image of violinplots below).
# import libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
# generate empty dataframe
df = pd.DataFrame(data=None,columns=['Fuel','MW'])
# generate fake data for each fuel type and append to df
for myfuel in ['Biomass','Coal','Hydro','Natural Gas','Oil','Solar','Wind','Other']:
df = df.append(
pd.DataFrame({'Fuel': myfuel,
# To make it easy to see that the violinplot of dfw (below)
# makes sense, here we'll just use a simple range list from
# 0 to 10
'MW': np.array(range(11))
}),
ignore_index=True
)
# I have to recast the data type here to avoid an error when using violinplot below
df.MW = df.MW.astype(float)
# create another empty dataframe
dfw = pd.DataFrame(data=None,columns=['Fuel','MW'])
# since dfw will be huge, specify data types (in particular, use "category" for Fuel to limit dfw size)
dfw = dfw.astype(dtype={'Fuel':'category', 'MW':'float'})
# Define the MW size by which to normalize all of the units
# Careful: too big -> loss of fidelity in data for small plants
# too small -> dfw will need to store an enormous amount of data
norm = 0.1 # this is in MW, so 0.1 MW = 100 kW
# Define a var to represent (for each row) how many basic units
# of size = norm there are in each row
mynum = 0
# loop through rows of df
for index, row in df.iterrows():
# calculate and store the number of norm MW there are within the MW of each plant
mynum = int(round(row['MW']/norm))
# insert mynum rows into dfw, each with Fuel = row['Fuel'] and MW = row['MW']
dfw = dfw.append(
pd.DataFrame({'Fuel': row['Fuel'],
'MW': np.array([row['MW']]*mynum,dtype='float')
}),
ignore_index=True
)
# Set up figure and axes
fig, (ax1, ax2) = plt.subplots(nrows=1, ncols=2, sharey='row')
# make violinplot
sns.violinplot(x = 'Fuel',
y = 'MW',
data=df,
inner=None,
scale='area',
cut=0,
linewidth=0.5,
ax = ax1
)
# make violinplot
sns.violinplot(x = 'Fuel',
y = 'MW',
data=dfw,
inner=None,
scale='area',
cut=0,
linewidth=0.5,
ax = ax2
)
# loop through the set of tick labels for both axes
# set tick label size and rotation
for item in (ax1.get_xticklabels() + ax2.get_xticklabels()):
item.set_fontsize(8)
item.set_rotation(30)
item.set_horizontalalignment('right')
plt.show()