Pandas - split columns and include counts

Question

I have the following dataframe:

            doc_id  is_fulltext
1243      dok:1            1
3310      dok:1            1
4370      dok:1            1
14403  dok:1020            1
17252  dok:1020            1
15977  dok:1020            0
16480  dok:1020            1
16252  dok:1020            1
468     dok:103            1
128    dok:1030            0
1673   dok:1038            1

I would like to split the is_fulltext column into two columns and count the occurrences of the docs at the same time.

Desired Output:

 doc_id                 fulltext  non-fulltext
0           dok:1        3          0
1           dok:1020     4          1
2           dok:103      1          0
3           dok:1030     0          1
4           dok:1038     1          0

I followed the procedure of Pandas - Create columns from column value, and fill with count

That post shows several alternatives, suggesting Categorical or reindex. I tried the following:

cats = ['fulltext', 'non_fulltext']
df_sorted['is_fulltext'] = pd.Categorical(df_sorted['is_fulltext'], categories=cats)
new_df = df_sorted.groupby(['doc_id', 'is_fulltext']).size().unstack(fill_value=0)

Here I get a ValueError:

ValueError: Length of passed values is 17446, index implies 0

Then I tried this method


cats = ['fulltext', 'non_fulltext']
new_df = df_sorted.groupby(['doc_id','is_fulltext']).size().unstack(fill_value=0).reindex(columns=cats).reset_index()

While this seems to have worked fine in the original post, my counts are filled with NANs (see below). I read by now that this happens when using reindex and categorical, but I wonder why it seems to have worked in the original post. And how can I solve this? Can anyone help? Thank you!

 doc_id                         fulltext  non-fulltext
0           dok:1                NaN          NaN
1           dok:1020             NaN          NaN
2           dok:103              NaN          NaN
3           dok:1030             NaN          NaN
4           dok:1038             NaN          NaN

yatu · Accepted Answer · 2019-03-28T09:26:52.527

You could GroupBy the doc_id, apply pd.value_counts to each group and unstack:

(df.groupby('doc_id').is_fulltext.apply(pd.value_counts)
                                 .unstack()
                                 .fillna(0)
                                 .rename(columns={0:'non-fulltext', 1:'fulltext'})
                                 .reset_index())

    doc_id      non-fulltext  fulltext
0     dok:1           0.0       3.0
1  dok:1020           1.0       4.0
2   dok:103           0.0       1.0
3  dok:1030           1.0       0.0
4  dok:1038           0.0       1.0

Or similarly to your own method, if performance is an issue do instead:

df.groupby(['doc_id','is_fulltext']).size()
                                    .unstack(fill_value=0)
                                    .rename(columns={0:'fulltext',1:'non_fulltext'})
                                    .reset_index()

is_fulltext    doc_id  fulltext  non_fulltext
0               dok:1         0             3
1            dok:1020         1             4
2             dok:103         0             1
3            dok:1030         1             0
4            dok:1038         0             1

Thanks, this works perfectly. There is one tiny little problem though: it is rather slow especially if you have a large set of data. Any way to solve this? — Py_Bear, Mar 28 '19 at 08:31
@Annika used your own answer. Using reindex was setting all values to nan. You wanted a rename. This should be much faster as id avoids any python level looping — yatu, Mar 28 '19 at 09:27
Thank you so much! Amazing what difference one little word makes (reindex vs. rename)... — Py_Bear, Mar 28 '19 at 10:12

score 0 · Answer 2 · answered Mar 27 '19 at 11:04

I don't know if it's the best approach, but this should work for you:

import pandas as pd
df = pd.DataFrame({"doc_id":["id1", "id2", "id1", "id2"], 
                   "is_fulltext":[1, 0, 1, 1]})
df_grouped = df.groupby("doc_id").sum().reset_index()
df_grouped["non_fulltext"] = df.groupby("doc_id").count().reset_index()["is_fulltext"] - df_grouped["is_fulltext"]
df_grouped

And the output is:

  doc_id  is_fulltext  non_fulltext
0    id1            2             0
1    id2            1             1

Thanks, this works great - unless you have a large set of data. It is fast, however, in the last rows the value for non_fulltext becomes NaN again... — Py_Bear, Mar 28 '19 at 08:33

Pandas - split columns and include counts

2 Answers2