I have the following dataframe:
doc_id is_fulltext
1243 dok:1 1
3310 dok:1 1
4370 dok:1 1
14403 dok:1020 1
17252 dok:1020 1
15977 dok:1020 0
16480 dok:1020 1
16252 dok:1020 1
468 dok:103 1
128 dok:1030 0
1673 dok:1038 1
I would like to split the is_fulltext column into two columns and count the occurrences of the docs at the same time.
Desired Output:
doc_id fulltext non-fulltext
0 dok:1 3 0
1 dok:1020 4 1
2 dok:103 1 0
3 dok:1030 0 1
4 dok:1038 1 0
I followed the procedure of Pandas - Create columns from column value, and fill with count
That post shows several alternatives, suggesting Categorical or reindex. I tried the following:
cats = ['fulltext', 'non_fulltext']
df_sorted['is_fulltext'] = pd.Categorical(df_sorted['is_fulltext'], categories=cats)
new_df = df_sorted.groupby(['doc_id', 'is_fulltext']).size().unstack(fill_value=0)
Here I get a ValueError:
ValueError: Length of passed values is 17446, index implies 0
Then I tried this method
cats = ['fulltext', 'non_fulltext']
new_df = df_sorted.groupby(['doc_id','is_fulltext']).size().unstack(fill_value=0).reindex(columns=cats).reset_index()
While this seems to have worked fine in the original post, my counts are filled with NANs (see below). I read by now that this happens when using reindex and categorical, but I wonder why it seems to have worked in the original post. And how can I solve this? Can anyone help? Thank you!
doc_id fulltext non-fulltext
0 dok:1 NaN NaN
1 dok:1020 NaN NaN
2 dok:103 NaN NaN
3 dok:1030 NaN NaN
4 dok:1038 NaN NaN