22

Suppose I have a DataFrame of created like this:

import pandas as pd
s1 = pd.Series(['a', 'b', 'a', 'c', 'a', 'b'])
s2 = pd.Series(['a', 'f', 'a', 'd', 'a', 'f', 'f'])
d = pd.DataFrame({'s1': s1, 's2', s2})

There is quite a lot of sparsity in the strings in the real data. I would like to create histograms of the occurrence of strings that looks like what is generated by d.hist() (eg. with subplots) for s1 and s2 (one per subplot).

Just doing d.hist() gives this error:

/Library/Python/2.7/site-packages/pandas/tools/plotting.pyc in hist_frame(data, column, by, grid, xlabelsize, xrot, ylabelsize, yrot, ax, sharex, sharey, **kwds)
   1725         ax.xaxis.set_visible(True)
   1726         ax.yaxis.set_visible(True)
-> 1727         ax.hist(data[col].dropna().values, **kwds)
   1728         ax.set_title(col)
   1729         ax.grid(grid)

/Library/Python/2.7/site-packages/matplotlib/axes.pyc in hist(self, x, bins, range, normed, weights, cumulative, bottom, histtype, align, orientation, rwidth, log, color, label, stacked, **kwargs)
   8099             # this will automatically overwrite bins,
   8100             # so that each histogram uses the same bins
-> 8101             m, bins = np.histogram(x[i], bins, weights=w[i], **hist_kwargs)
   8102             if mlast is None:
   8103                 mlast = np.zeros(len(bins)-1, m.dtype)

/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/numpy/lib/function_base.pyc in histogram(a, bins, range, normed, weights, density)
    167             else:
    168                 range = (a.min(), a.max())
--> 169         mn, mx = [mi+0.0 for mi in range]
    170         if mn == mx:
    171             mn -= 0.5

TypeError: cannot concatenate 'str' and 'float' objects

I suppose I could manually go through each series, do a value_counts(), then plot it as a bar plot, and manually create the subplots. I wanted to check if there is a simpler way.

Brian Tompsett - 汤莱恩
  • 5,753
  • 72
  • 57
  • 129
amatsukawa
  • 841
  • 2
  • 10
  • 21
  • All the answers referring to value_count are wrong, since the question is about generating an histogram and not just counting values. An histogram for a collection of strings is best captured as categorical and sortable data, with a min value a max values, bins and total ordering. – natbusa Jun 28 '20 at 13:59

3 Answers3

33

Recreating the dataframe:

import pandas as pd
s1 = pd.Series(['a', 'b', 'a', 'c', 'a', 'b'])
s2 = pd.Series(['a', 'f', 'a', 'd', 'a', 'f', 'f'])
d = pd.DataFrame({'s1': s1, 's2': s2})

To get the histogram with subplots as desired:

d.apply(pd.value_counts).plot(kind='bar', subplots=True)

enter image description here

The OP mentioned pd.value_counts in the question. I think the missing piece is just that there is no reason to "manually" create the desired bar plot.

The output from d.apply(pd.value_counts) is a pandas dataframe. We can plot the values like any other dataframe, and selecting the option subplots=True gives us what we want.

Aman
  • 45,819
  • 7
  • 35
  • 37
  • This simply works! any idea why matplotlib's hist fails to draw the same (it simply takes forever) rather than using `value_counts` and a bar chart like here? – matanster Aug 26 '18 at 15:58
11

You can use pd.value_counts (value_counts is also a series method):

In [20]: d.apply(pd.value_counts)
Out[20]: 
   s1  s2
a   3   3
b   2 NaN
c   1 NaN
d NaN   1
f NaN   3

and than plot the resulting DataFrame.

bmu
  • 35,119
  • 13
  • 91
  • 108
1

I would shove the Series into a collections.Counter (documentation) (You might need to convert it to a list first). I am not a pandas expert, but I think you should be able to fold the Counter object back into a Series, indexed by the strings, and use that to make your plots.

This is not working because it is (rightly) raising errors when it tries to guess where the bin edges should be, which simply makes no sense with strings.

tacaswell
  • 84,579
  • 22
  • 210
  • 199
  • ag, beat me to it! yes, counter is the tool for the job! – Andy Hayden Feb 21 '13 at 01:06
  • 1
    Thanks for the response. value_counts does the same thing, and is a Series -> Series transformation (so there is no need to force it back into a Series). I guess I was wondering if there was some option to do this counting and plotting for me automatically for this specific case of strings, because there is one for ints. – amatsukawa Feb 21 '13 at 01:31