-1

I have a dataset with 3 columns and over 300,000 rows. The first two column shows the symptoms a patient is experiencing and the last column shows the vaccine they were given. I want a dataset which will count the number of combinations of symptoms for Each Vaccine. I am attaching a sample Dataset and what I want below:

Exp=pd.DataFrame({'sym1':['pain','fever','headache','pain','paralysis'],\
                  'sym2':['fever','Cancer','paralysis','fever','pain'],'Vaccine':['a','b','c','a','d']})

it looks like this:

    sym1    sym2       Vaccine
0   pain    fever         a
1   fever   Cancer        b
2   headache paralysis    c
3   pain    fever         a
4   paralysis pain        d

What I want is this:

Result=pd.DataFrame({'sym1':['pain','fever','headache','paralysis'],\
                  'sym2':['fever','Cancer','paralysis','pain'],'Vaccine':['a','b','c','d'],'Count':[2,1,1,1,]})

and it looks like this:

sym1         sym2       Vaccine   Count
0   pain     fever        a        2
1   fever    Cancer       b        1
2   headache paralysis    c        1
3   paralysis   pain      d        1

2 Answers2

1

Lots of ways to do this:

With a groupby, transform and count and drop duplicate:

Exp["Count"] = Exp.groupby(["sym1", "sym2", "Vaccine"])["sym1"].transform("count")
# Can add a reset_index if needed
Exp = Exp.drop_duplicates()

or a classic group by and agg:

Exp = Exp.groupby(["sym1", "sym2", "Vaccine"], as_index=False).agg(Count=("sym1", "count"))

Cameron's got a really clever way though.

Simon
  • 1,201
  • 9
  • 18
0

If you have pandas version > 1.1.0 you can do:

counts_df = (Exp.value_counts()
             .rename("counts")
             .reset_index())

print(counts_df)
        sym1       sym2 Vaccine  counts
0       pain      fever       a       2
1  paralysis       pain       d       1
2   headache  paralysis       c       1
3      fever     Cancer       b       1
Cameron Riddell
  • 10,942
  • 9
  • 19