Efficient counting of word occurrences in Python

Question

Suppose I have these two tables:

Table1:

    ID   CODE         DATE        value1   value2   text
    -----------------------------------------------------
    1    13A       2012-05-04      12.0     0.0     null
    2    13B       2011-06-08      5.5      0.0     null
    3    13C       2012-07-05      4.0      0.0     null
    4    13D       2010-09-09      7.7      0.0     null
    1    13A       .....................................
    1    13D       .....................................
    3    13D       .....................................

Table2:

    CODE  DESCRIPTION
    ------------------
    13A    DISEASE1
    13B    DISEASE2
    13C    DISEASE3
    13D    DISEASE4

I want to find an efficient way of counting the code occurrences for each id and create count vectors based on the codes from the second table..For example:

[2,0,0,1] represents the count vector for person with id=1, where each value is the occurrence of the code from table2

I managed to do that in way but it looks like it is not very efficient...Is there a more efficient way?

sql = "SELECT * FROM table1"
cursor.execute(sql)
table1 = cursor.fetchall()

sql2 = "SELECT CODE FROM table2"
cursor.execute(sql2)
codes = cursor.fetchall()

list1 = []
list2 = []
cnt = Counter()
countList = []
n=len(codes)

for id,iter in itertools.groupby(table1,operator.itemgetter('ID')):
    idList = list(iter)
    list1.append(list((z['CODE']) for z in idList))
for pat in list1:
    for code in codes: 
        cnt=pat.count(code.get('CODE'))
        list2.append(cnt)
countList = [list2[i:i+n] for i in range(0, len(list2), n)]

Something tells me you should write a better SQL query and let the DBMS optimize it for you — inspectorG4dget, Jul 13 '13 at 08:12

Blender · Accepted Answer · 2013-07-13T04:14:34.013

0

Using generators will probably speed it up:

import itertools
import operator

def code_counter(table, codes):
    for key, group in itertools.groupby(table, key=operator.itemgetter('ID')):
        group_codes = [item['CODE'] for item in group]

        yield [group_codes.count(code) for code in codes]

if __name__ == '__main__':
    cursor.execute("SELECT * FROM table1")
    table1 = cursor.fetchall()

    cursor.execute("SELECT CODE FROM table2")
    codes = [code.get('code') for code in cursor.fetchall()]

    for chunk in code_counter(table1, codes):
        print(chunk)

You might want to iterate over table1 in chunks.

edited Jul 13 '13 at 04:14

answered Jul 13 '13 at 00:41

Blender

289,723
53
439
496

Hey thanks! for some reason I only get the occurrences for the first code for each id though...the rest are zeros. Do you know why? – user2578185 Jul 13 '13 at 02:36
@user2578185: See my update. Are you sure this is even a bottleneck? – Blender Jul 13 '13 at 04:15

Efficient counting of word occurrences in Python

1 Answers1