Turn Pandas DataFrame of strings into histogram

Question

Suppose I have a DataFrame of created like this:

import pandas as pd
s1 = pd.Series(['a', 'b', 'a', 'c', 'a', 'b'])
s2 = pd.Series(['a', 'f', 'a', 'd', 'a', 'f', 'f'])
d = pd.DataFrame({'s1': s1, 's2', s2})

There is quite a lot of sparsity in the strings in the real data. I would like to create histograms of the occurrence of strings that looks like what is generated by d.hist() (eg. with subplots) for s1 and s2 (one per subplot).

Just doing d.hist() gives this error:

/Library/Python/2.7/site-packages/pandas/tools/plotting.pyc in hist_frame(data, column, by, grid, xlabelsize, xrot, ylabelsize, yrot, ax, sharex, sharey, **kwds)
   1725         ax.xaxis.set_visible(True)
   1726         ax.yaxis.set_visible(True)
-> 1727         ax.hist(data[col].dropna().values, **kwds)
   1728         ax.set_title(col)
   1729         ax.grid(grid)

/Library/Python/2.7/site-packages/matplotlib/axes.pyc in hist(self, x, bins, range, normed, weights, cumulative, bottom, histtype, align, orientation, rwidth, log, color, label, stacked, **kwargs)
   8099             # this will automatically overwrite bins,
   8100             # so that each histogram uses the same bins
-> 8101             m, bins = np.histogram(x[i], bins, weights=w[i], **hist_kwargs)
   8102             if mlast is None:
   8103                 mlast = np.zeros(len(bins)-1, m.dtype)

/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/numpy/lib/function_base.pyc in histogram(a, bins, range, normed, weights, density)
    167             else:
    168                 range = (a.min(), a.max())
--> 169         mn, mx = [mi+0.0 for mi in range]
    170         if mn == mx:
    171             mn -= 0.5

TypeError: cannot concatenate 'str' and 'float' objects

I suppose I could manually go through each series, do a value_counts(), then plot it as a bar plot, and manually create the subplots. I wanted to check if there is a simpler way.

All the answers referring to value_count are wrong, since the question is about generating an histogram and not just counting values. An histogram for a collection of strings is best captured as categorical and sortable data, with a min value a max values, bins and total ordering. — natbusa, Jun 28 '20 at 13:59

score 33 · Answer 1 · answered Feb 26 '14 at 21:35

Recreating the dataframe:

import pandas as pd
s1 = pd.Series(['a', 'b', 'a', 'c', 'a', 'b'])
s2 = pd.Series(['a', 'f', 'a', 'd', 'a', 'f', 'f'])
d = pd.DataFrame({'s1': s1, 's2': s2})

To get the histogram with subplots as desired:

d.apply(pd.value_counts).plot(kind='bar', subplots=True)

enter image description here

The OP mentioned pd.value_counts in the question. I think the missing piece is just that there is no reason to "manually" create the desired bar plot.

The output from d.apply(pd.value_counts) is a pandas dataframe. We can plot the values like any other dataframe, and selecting the option subplots=True gives us what we want.

This simply works! any idea why matplotlib's hist fails to draw the same (it simply takes forever) rather than using `value_counts` and a bar chart like here? — matanster, Aug 26 '18 at 15:58

score 11 · Answer 2 · answered Feb 21 '13 at 06:13

11

You can use pd.value_counts (value_counts is also a series method):

In [20]: d.apply(pd.value_counts)
Out[20]: 
   s1  s2
a   3   3
b   2 NaN
c   1 NaN
d NaN   1
f NaN   3

and than plot the resulting DataFrame.

answered Feb 21 '13 at 06:13

bmu

35,119
13
91
108

score 1 · Answer 3 · answered Feb 21 '13 at 00:50

1

I would shove the Series into a collections.Counter (documentation) (You might need to convert it to a list first). I am not a pandas expert, but I think you should be able to fold the Counter object back into a Series, indexed by the strings, and use that to make your plots.

This is not working because it is (rightly) raising errors when it tries to guess where the bin edges should be, which simply makes no sense with strings.

answered Feb 21 '13 at 00:50

tacaswell

84,579
22
210
199

ag, beat me to it! yes, counter is the tool for the job! – Andy Hayden Feb 21 '13 at 01:06
1

Thanks for the response. value_counts does the same thing, and is a Series -> Series transformation (so there is no need to force it back into a Series). I guess I was wondering if there was some option to do this counting and plotting for me automatically for this specific case of strings, because there is one for ints. – amatsukawa Feb 21 '13 at 01:31

Turn Pandas DataFrame of strings into histogram

3 Answers3

Linked

Related