1

I have a df like this,

Owner   Messages
AAA     (YY) Duplicates
AAA     Missing Number; (VV) Corrected Value; (YY) Duplicates
AAA     (YY) Duplicates
BBB     (YY) Duplicates
BBB     Missing Measure; Missing Number

When I do a normal groupby like this,

df_grouped = df.groupby([' Owner', 'Messages']).size().reset_index(name='count')
df_grouped

I get this as expected,

    Owner  Messages                                               count
0   AAA   (YY) Duplicates                                           2
1   AAA   Missing Number; (VV) Corrected Value; (YY) Duplicates     1
2   BBB   (YY) Duplicates                                           1
3   BBB   Missing Measure; Missing Number                           1

However, I need something (desired output) like this splitting by ; inside Messages column.

   Owner    Messages             count
0   AAA    (YY) Duplicates       3
1   AAA    Missing Number        1
2   AAA    (VV) Corrected Value  1
3   BBB    (YY) Duplicates       1
4   BBB    Missing Measure       1
5   BBB    Missing Number        1

So far, based on this post, @LeoRochael's answer, it splits Messages column's values by ; and puts into a list. Anyhow, I can not get the individual count after splitting.

Any ideas how to get my desired output?

i.n.n.m
  • 2,936
  • 7
  • 27
  • 51

2 Answers2

6

You need to unnest your original dataframe , then we just do group size

s=df.set_index('Owner').Messages.str.split('; ',expand=True).stack().to_frame('Messages').reset_index()
s.groupby(['Owner','Messages']).size()
Out[1213]: 
Owner  Messages            
AAA    (VV) Corrected Value    1
       (YY) Duplicates         3
       Missing Number          1
BBB    (YY) Duplicates         1
       Missing Measure         1
       Missing Number          1
dtype: int64
BENY
  • 317,841
  • 20
  • 164
  • 234
  • I am still getting `AAA`'s `(YY) Duplicates` 2 and then another `(YY) Duplicates` 1 of them. You seem to get 3. I am not sure what I am missing. – i.n.n.m Apr 19 '18 at 14:33
  • @i.n.n.m ('; ') have a blank after ; – BENY Apr 19 '18 at 14:34
  • Is `expand = True` necessary here?... I got there by `df.loc[:, 'Messages'] = df.loc[:, 'Messages'].str.split('; '); df.set_index(['Owner'])['Messages'].apply(pd.Series).stack().reset_index()....` – Sotos Apr 19 '18 at 14:39
  • t@Sotos If adding `apply(pd.Series)` we do not need expand =True – BENY Apr 19 '18 at 14:40
  • @Wen I have a space too right after `;`. I tried @piRSquared's answer still getting the same reult, I don't get count 3 for `(YY) Duplicates`. I know I might be missing something simple, but can't seem to find it! – i.n.n.m Apr 19 '18 at 14:45
  • @i.n.n.m adding this line , `s.Messages=s.Messages.str.strip()` before groupby – BENY Apr 19 '18 at 14:46
  • 1
    @i.n.n.m yw :-) happy coding – BENY Apr 19 '18 at 14:49
2
from collections import Counter
import pandas as pd

pd.Series(
    Counter([(o, m) for o, M in df.values for m in M.split('; ')])
).rename_axis(['Owner', 'Message']).reset_index(name='Count')

  Owner               Message  Count
0   AAA  (VV) Corrected Value      1
1   AAA       (YY) Duplicates      3
2   AAA        Missing Number      1
3   BBB       (YY) Duplicates      1
4   BBB       Missing Measure      1
5   BBB        Missing Number      1
piRSquared
  • 285,575
  • 57
  • 475
  • 624