2

I have three columns. the table looks like this:

ID.   names     tag
1.     john.     1
2.     sam       0
3.    sam,robin. 1
4.     robin.    1

Id: type integer Names: type string Tag: type integer (just 0,1)

What I want is to find how many times each name is repeated grouped by 0 and 1. this is to be done in python.

Answer must look like

               0                 1
John           23                12
Robin          32                10
sam            9                 30
user3483203
  • 50,081
  • 9
  • 65
  • 94
Kathan Vyas
  • 355
  • 3
  • 16
  • Are those names separated by commas in a list `['sam', 'robin.']` or is it the string `'sam,robin.'`? – ALollz Oct 16 '18 at 19:49
  • I cannot understand the expected output. How count for John is 23? and what is 12? – mad_ Oct 16 '18 at 19:51
  • It's just an example assuming that for a given full input (not 4 rows) john would happen to pop up 23 times with tag=1 and 12 times with tag=0 – Thunderwood Oct 16 '18 at 19:52
  • @ALollz well those are strings separated by semicolon (;) – Kathan Vyas Oct 16 '18 at 19:57
  • @mad_ yes user-thunderwood is right. it's just an example. we need how many times those names are getting repeated for each of the two tags. – Kathan Vyas Oct 16 '18 at 19:58
  • 1
    Then you need to provide a [mcve] that represents your data. Otherwise, you're going to get a solution to the problem you posted, not the problem you're having. If there's a semicolon between the values, then there should be a semicolon in your example. – ALollz Oct 16 '18 at 20:06
  • 1
    @ALollz sorry for teh confusion. I ahve edited my question. – Kathan Vyas Oct 16 '18 at 20:13

2 Answers2

4

Using extractall and crosstab:

s = df.names.str.extractall(r'(\w+)').reset_index(1, drop=True).join(df.tag)

pd.crosstab(s[0], s['tag'])

tag    0  1
0
john   0  1
robin  0  2
sam    1  1
user3483203
  • 50,081
  • 9
  • 65
  • 94
  • 1
    yup this works. I mistyped it. Thank you so much. This is a much concise answer. – Kathan Vyas Oct 16 '18 at 20:24
  • This separates everything including ones separated by slashes and space. CAn we limit the separating entity to just the semicolon (;)? – Kathan Vyas Oct 16 '18 at 20:38
  • It's not clear how you know how to group `John A/John Adams` with `John A`. That makes this a *much* more complicated problem. I rolled back your question because you shouldn't edit it after you already have answers, but you should ask a new question with your new requirements. – user3483203 Oct 16 '18 at 20:44
  • Ohk. I did not know this. I am bit new here. So the changed question also works for me. under that scenario, and keeping your answer in mind, is there any way where i can control teh separation to just " ; " and no other things? – Kathan Vyas Oct 16 '18 at 20:50
  • 1
    You could create `s` using something like `s = df.names.str.extractall(r'([^;.]+)')[0].str.strip().reset_index(1, drop=True).to_frame().join(df.tag)` – user3483203 Oct 16 '18 at 20:51
  • But remember, this won't understand what `John A`, and `John A/John Adams` are referring to the same value. – user3483203 Oct 16 '18 at 20:52
  • 1
    Thank you.This solved a major part of my problem. I can do it form there. Thanks a lot. – Kathan Vyas Oct 16 '18 at 21:28
2

Because of the nature of your names column, there is some re-processing that needs to be done before you can get value counts. In the case of your example dataframe, this could look something like:

my_counts = (df.set_index(['ID.', 'tag'])
             # Get rid of periods and split on commas
             .names.str.strip('.').str.split(',')
             .apply(pd.Series)
             .stack()
             .reset_index([0, 1])
             # rename column 0 for consistency, easier reading
             .rename(columns={0: 'names'})
             # Get value counts of names per tag:
             .groupby('tag')['names']
             .value_counts()
             .unstack('tag', fill_value=0))

>>> my_counts
tag    0  1
names      
john   0  1
robin  0  2
sam    1  1
sacuL
  • 49,704
  • 8
  • 81
  • 106
  • pandas data frames does not have a names attribute. So I could not verify your solution – Kathan Vyas Oct 16 '18 at 20:03
  • Also some names are like John A/John adams ; Sam John ; robin john thus if we remove everything wont it give me wrong asnwer – Kathan Vyas Oct 16 '18 at 20:03
  • 1
    for your first comment, what is the output of `df.columns`? Because your dataframe clearly appears to have a `names` column. And for your second, you'll have to provide an example which actually represents your cases. The code above provides the solution I posted for the dataframe you supplied. We cannot infer the different kind of structure you have in your `names` column. – sacuL Oct 16 '18 at 20:05
  • oh Sorry my bad. My dataframe contains "names" as a column name but when we write df.names , it shows names is not an attribute of dataframe. Also when I write df.columns, it shows me list of column names which is ['id', 'names','tag'] – Kathan Vyas Oct 16 '18 at 20:11
  • I got an error: " 'DataFrameGroupBy' object has no attribute 'value_counts'" – Kathan Vyas Oct 16 '18 at 20:23