I'm struggling with conditional counting in pandas.
Problem
I have a pandas dataframe that has 4 columns (for the sake of this example) : "id", "id2", "col1" and "type". The type column can have 3 values, namely "A", "B" and "C". What I'd like to do is, for each row, count the number of type C with the same id and id2. Here is a sample dataframe:
id id2 col1 type
0 "e" "z" 0 "A"
1 "e" "z" 1 "C"
2 "e" "z" 2 "C"
3 "e" "y" 3 "C"
4 "e" "y" 4 "A"
5 "f" "y" 4 "A"
6 "f" "x" 3 "B"
7 "f" "x" 4 "B"
8 "g" "w" 5 "C"
9 "g" "w" 6 "B"
The code to build the sample dataframe:
pd.DataFrame({
"id": ["e", "e", "e", "e", "e", "f", "f", "f", "g", "g"],
"id2": ["z", "z", "z", "y", "y", "x", "x", "x", "w", "w"],
"col1": [ 0 , 1 , 2 , 3 , 4 , 4 , 3 , 4 , 5 , 6 ],
"type": ["A", "C", "C", "C", "A", "A", "B", "B", "C", "B"]
})
And the desired result :
id id2 col1 type count
0 "e" "z" 0 "A" 2
1 "e" "z" 1 "C" 2
2 "e" "z" 2 "C" 2
3 "e" "y" 3 "C" 1
4 "e" "y" 4 "A" 1
5 "f" "y" 4 "A" 0
6 "f" "x" 3 "B" 0
7 "f" "x" 4 "B" 0
8 "g" "w" 5 "C" 1
9 "g" "w" 6 "B" 1
I don't really care about what happens to row with type "C" (eg. row 1, 2, 3, 8) so that's not a problem if they don't appear in the resulting dataframe.
I'd like a solution that doesn't rely on iterating "myself" through the dataset (no apply nor for loop) as they are too slow. I'm hopping to find a "pandaic" way of solving the problem.
Note: in the "real" dataset there are 3 columns used to index, type can have 5 different values and 36 data column should be preserved. But I prefer a scalable solution, not bounded to those number.
What I've tried
I can solve the problem using sqlalchemy and a query. Indeed, results should match the following query :
SELECT a.*, (SELECT COUNT(*)
FROM df b
WHERE
b.id = a.id AND
b.id2 = a.id2 AND
b.type = "C")
FROM df a
The initial problem can also be reworded as "what's the python code equivalent to this query ?".
I can also solve the problem using apply. Both are very slow due to the size of the dataset, although sql method is probably slow because it has to build the database at first.
Related posts
This post almost solves the problem, but doesn't work with external data column nor with multiple indexing and I couldn't adapt them for my example.
This line is close to what I'm looking for, the only issue is that it only keeps column you grouped by :
df.groupby(["id", "id2", "type"]).size().unstack().reset_index()
If any information is missing, please let me know. Thank you for taking the time to read my post and sorry for the spelling mistakes !