Record included for each document - processing CSV

Question

I have a large CSV file with the following structure:

doc_id, inclusion, id
 1, TRUE, 842
 1, FALSE, 768
 1, FALSE, 292
 1, FALSE, 393
 2, TRUE, 191
 2, TRUE, 389
 2, TRUE, 171
 ...

The id is the primary key. doc_id is a foreign and represents the document which each id is linked to. Many id's are linked to each document. Each id is classified as included (ie. inclusion == True) or excluded (ie. inclusion == False).

I need to create a summary table which displays for each doc_id, the number of id's included and excluded. I can do this for all doc_id's but am not sure how to implement it for each doc_id. For example, for the data above I want:

doc_id included excluded
1      1        3
2      3        0
...

This is my current code:

for index, row in citationData.iterrows():
    if row.included == True:
        inc = inc + 1
    else:
        exc = exc + 1

score 1 · Accepted Answer · answered Jul 31 '18 at 22:53

1

You can use pd.crosstab.

res = pd.crosstab(df['doc_id'], df['inclusion'])

print(res)

inclusion  False  True 
doc_id                 
1              3      1
2              0      3

In general, you should avoid Python-level loops when working with Pandas / NumPy arrays.

answered Jul 31 '18 at 22:53

jpp

159,742
34
281
339

Many thanks. Why should I avoid loops with Panda/NumPy arrays? – jdoe Jul 31 '18 at 23:00
1

[This answer](https://stackoverflow.com/a/25778164/9209546) has a nice description (with comparison to lists). Essentially, you are doing more work at C-level than Python-level. – jpp Jul 31 '18 at 23:02

score 0 · Answer 2 · answered Jul 31 '18 at 22:53

You can agg with Counter

from collections import Counter

s = df.groupby('doc_id').inclusion.agg(Counter)

doc_id
1    {' TRUE': 1, ' FALSE': 3}
2                 {' TRUE': 3}
Name: inclusion, dtype: object

If you need a dataframe, just use the default constructor.

pd.DataFrame(s.tolist(), index=s.index).fillna(0)

         FALSE  TRUE
doc_id      
1        3      1
2        0      3

Record included for each document - processing CSV

2 Answers2