counting all string values in given column of a table and grouping it based on third column

Question

I have three columns. the table looks like this:

ID.   names     tag
1.     john.     1
2.     sam       0
3.    sam,robin. 1
4.     robin.    1

Id: type integer Names: type string Tag: type integer (just 0,1)

What I want is to find how many times each name is repeated grouped by 0 and 1. this is to be done in python.

Answer must look like

               0                 1
John           23                12
Robin          32                10
sam            9                 30

Are those names separated by commas in a list `['sam', 'robin.']` or is it the string `'sam,robin.'`? — ALollz, Oct 16 '18 at 19:49
I cannot understand the expected output. How count for John is 23? and what is 12? — mad_, Oct 16 '18 at 19:51
It's just an example assuming that for a given full input (not 4 rows) john would happen to pop up 23 times with tag=1 and 12 times with tag=0 — Thunderwood, Oct 16 '18 at 19:52
@mad_ yes user-thunderwood is right. it's just an example. we need how many times those names are getting repeated for each of the two tags. — Kathan Vyas, Oct 16 '18 at 19:58
Then you need to provide a [mcve] that represents your data. Otherwise, you're going to get a solution to the problem you posted, not the problem you're having. If there's a semicolon between the values, then there should be a semicolon in your example. — ALollz, Oct 16 '18 at 20:06

score 4 · Accepted Answer · answered Oct 16 '18 at 20:14

4

Using extractall and crosstab:

s = df.names.str.extractall(r'(\w+)').reset_index(1, drop=True).join(df.tag)

pd.crosstab(s[0], s['tag'])

tag    0  1
0
john   0  1
robin  0  2
sam    1  1

answered Oct 16 '18 at 20:14

user3483203

50,081
9
65
94

1

yup this works. I mistyped it. Thank you so much. This is a much concise answer. – Kathan Vyas Oct 16 '18 at 20:24
This separates everything including ones separated by slashes and space. CAn we limit the separating entity to just the semicolon (;)? – Kathan Vyas Oct 16 '18 at 20:38
It's not clear how you know how to group `John A/John Adams` with `John A`. That makes this a *much* more complicated problem. I rolled back your question because you shouldn't edit it after you already have answers, but you should ask a new question with your new requirements. – user3483203 Oct 16 '18 at 20:44
Ohk. I did not know this. I am bit new here. So the changed question also works for me. under that scenario, and keeping your answer in mind, is there any way where i can control teh separation to just " ; " and no other things? – Kathan Vyas Oct 16 '18 at 20:50
1

You could create `s` using something like `s = df.names.str.extractall(r'([^;.]+)')[0].str.strip().reset_index(1, drop=True).to_frame().join(df.tag)` – user3483203 Oct 16 '18 at 20:51
But remember, this won't understand what `John A`, and `John A/John Adams` are referring to the same value. – user3483203 Oct 16 '18 at 20:52
1

Thank you.This solved a major part of my problem. I can do it form there. Thanks a lot. – Kathan Vyas Oct 16 '18 at 21:28

score 2 · Answer 2 · answered Oct 16 '18 at 19:57

2

Because of the nature of your names column, there is some re-processing that needs to be done before you can get value counts. In the case of your example dataframe, this could look something like:

my_counts = (df.set_index(['ID.', 'tag'])
             # Get rid of periods and split on commas
             .names.str.strip('.').str.split(',')
             .apply(pd.Series)
             .stack()
             .reset_index([0, 1])
             # rename column 0 for consistency, easier reading
             .rename(columns={0: 'names'})
             # Get value counts of names per tag:
             .groupby('tag')['names']
             .value_counts()
             .unstack('tag', fill_value=0))

>>> my_counts
tag    0  1
names      
john   0  1
robin  0  2
sam    1  1

answered Oct 16 '18 at 19:57

sacuL

49,704
8
81
106

pandas data frames does not have a names attribute. So I could not verify your solution – Kathan Vyas Oct 16 '18 at 20:03
Also some names are like John A/John adams ; Sam John ; robin john thus if we remove everything wont it give me wrong asnwer – Kathan Vyas Oct 16 '18 at 20:03
1

for your first comment, what is the output of `df.columns`? Because your dataframe clearly appears to have a `names` column. And for your second, you'll have to provide an example which actually represents your cases. The code above provides the solution I posted for the dataframe you supplied. We cannot infer the different kind of structure you have in your `names` column. – sacuL Oct 16 '18 at 20:05
oh Sorry my bad. My dataframe contains "names" as a column name but when we write df.names , it shows names is not an attribute of dataframe. Also when I write df.columns, it shows me list of column names which is ['id', 'names','tag'] – Kathan Vyas Oct 16 '18 at 20:11
I got an error: " 'DataFrameGroupBy' object has no attribute 'value_counts'" – Kathan Vyas Oct 16 '18 at 20:23

counting all string values in given column of a table and grouping it based on third column

2 Answers2