0

I have a large CSV file with the following structure:

doc_id, inclusion, id
 1, TRUE, 842
 1, FALSE, 768
 1, FALSE, 292
 1, FALSE, 393
 2, TRUE, 191
 2, TRUE, 389
 2, TRUE, 171
 ...

The id is the primary key. doc_id is a foreign and represents the document which each id is linked to. Many id's are linked to each document. Each id is classified as included (ie. inclusion == True) or excluded (ie. inclusion == False).

I need to create a summary table which displays for each doc_id, the number of id's included and excluded. I can do this for all doc_id's but am not sure how to implement it for each doc_id. For example, for the data above I want:

doc_id included excluded
1      1        3
2      3        0
...

This is my current code:

for index, row in citationData.iterrows():
    if row.included == True:
        inc = inc + 1
    else:
        exc = exc + 1
jpp
  • 159,742
  • 34
  • 281
  • 339
jdoe
  • 634
  • 5
  • 19

2 Answers2

1

You can use pd.crosstab.

res = pd.crosstab(df['doc_id'], df['inclusion'])

print(res)

inclusion  False  True 
doc_id                 
1              3      1
2              0      3

In general, you should avoid Python-level loops when working with Pandas / NumPy arrays.

jpp
  • 159,742
  • 34
  • 281
  • 339
  • Many thanks. Why should I avoid loops with Panda/NumPy arrays? – jdoe Jul 31 '18 at 23:00
  • 1
    [This answer](https://stackoverflow.com/a/25778164/9209546) has a nice description (with comparison to lists). Essentially, you are doing more work at C-level than Python-level. – jpp Jul 31 '18 at 23:02
0

You can agg with Counter

from collections import Counter

s = df.groupby('doc_id').inclusion.agg(Counter)

doc_id
1    {' TRUE': 1, ' FALSE': 3}
2                 {' TRUE': 3}
Name: inclusion, dtype: object

If you need a dataframe, just use the default constructor.

pd.DataFrame(s.tolist(), index=s.index).fillna(0)

         FALSE  TRUE
doc_id      
1        3      1
2        0      3
rafaelc
  • 57,686
  • 15
  • 58
  • 82