Row-wise conditional counting keeping all columns without iterating over dataframe

Question

I'm struggling with conditional counting in pandas.

Problem

I have a pandas dataframe that has 4 columns (for the sake of this example) : "id", "id2", "col1" and "type". The type column can have 3 values, namely "A", "B" and "C". What I'd like to do is, for each row, count the number of type C with the same id and id2. Here is a sample dataframe:

    id   id2 col1 type
0   "e"  "z"   0   "A"
1   "e"  "z"   1   "C"
2   "e"  "z"   2   "C"
3   "e"  "y"   3   "C"
4   "e"  "y"   4   "A"  
5   "f"  "y"   4   "A"
6   "f"  "x"   3   "B"
7   "f"  "x"   4   "B"
8   "g"  "w"   5   "C"
9   "g"  "w"   6   "B"

The code to build the sample dataframe:

pd.DataFrame({
    "id":   ["e", "e", "e", "e", "e", "f", "f", "f", "g", "g"],
    "id2":  ["z", "z", "z", "y", "y", "x", "x", "x", "w", "w"],
    "col1": [ 0 ,  1 ,  2 ,  3 ,  4 ,  4 ,  3 ,  4 ,  5 ,  6 ],
    "type": ["A", "C", "C", "C", "A", "A", "B", "B", "C", "B"]
})

And the desired result :

    id   id2 col1 type  count
0   "e"  "z"   0   "A"    2
1   "e"  "z"   1   "C"    2
2   "e"  "z"   2   "C"    2 
3   "e"  "y"   3   "C"    1
4   "e"  "y"   4   "A"    1
5   "f"  "y"   4   "A"    0
6   "f"  "x"   3   "B"    0
7   "f"  "x"   4   "B"    0
8   "g"  "w"   5   "C"    1
9   "g"  "w"   6   "B"    1

I don't really care about what happens to row with type "C" (eg. row 1, 2, 3, 8) so that's not a problem if they don't appear in the resulting dataframe.

I'd like a solution that doesn't rely on iterating "myself" through the dataset (no apply nor for loop) as they are too slow. I'm hopping to find a "pandaic" way of solving the problem.

Note: in the "real" dataset there are 3 columns used to index, type can have 5 different values and 36 data column should be preserved. But I prefer a scalable solution, not bounded to those number.

What I've tried

I can solve the problem using sqlalchemy and a query. Indeed, results should match the following query :

SELECT a.*, (SELECT COUNT(*)
             FROM df b
             WHERE 
                 b.id  = a.id  AND
                 b.id2 = a.id2 AND
                 b.type = "C")
FROM df a

The initial problem can also be reworded as "what's the python code equivalent to this query ?".

I can also solve the problem using apply. Both are very slow due to the size of the dataset, although sql method is probably slow because it has to build the database at first.

answer = df.groupby(['id','id2']).transform(sum)['type'].str.count('C')
pd.concat([df,answer], axis=1)

  id id2  col1  type  type
0  e   z     0     A     2
1  e   z     1     C     2
2  e   z     2     C     2
3  e   y     3     C     1
4  e   y     4     A     1
5  f   x     4     A     0
6  f   x     3     B     0
7  f   x     4     B     0
8  g   w     5     C     1
9  g   w     6     B     1

You can increase the columns in the groupby to whichever/how many you wish.

Thank you, this works perfectly. Concatenating the columns to count what I'm looking for is a nice trick I'll keep in mind! — Sacrefeu, Sep 26 '20 at 23:42

Row-wise conditional counting keeping all columns without iterating over dataframe

Problem

What I've tried

Related posts

1 Answers1