Easy way to compute Large data file python

Question

I have to compute the data from a large file. File has around 100000 rows and 3 columns. The Program below works great with a small test file but when trying to run with a large file it takes ages to display even one result. Any suggestions to speed the loading and computing of large data file.

Code: Computation is perfect with small test file, input format given below

from collections import defaultdict
paircount = defaultdict(int)
pairtime = defaultdict(float)
pairper = defaultdict(float)

#get number of pair occrences and total time 
with open('input.txt', 'r') as f:
  with open('output.txt', 'w') as o: 
    numline = 0
    for line in f:
        numline += 1
            line = line.split()
        pair = line[0], line[1]
        paircount[pair] += 1
        pairtime[pair] += float(line[2])
        pairper = dict((pair, c * 100.0 / numline) for (pair, c) in paircount.iteritems())

    for pair, c in paircount.iteritems():
        #print pair[0], pair[1], c, pairper[pair], pairtime[pair]
        o.write("%s, %s, %s, %s, %s\n" % (pair[0], pair[1], c, pairper[pair], pairtime[pair]))

Inputfile:

5372 2684 460.0
1885 1158 351.0
1349 1174 6375.0
1980 1174 650.0
1980 1349 650.0
4821 2684 469.0
4821 937  459.0
2684 937  318.0
1980 606  390.0
1349 606  750.0
1174 606  750.0

Instead of building a dict and then iterating over it to write results to file, just write them to file on the spot. — Tymoteusz Paul, Aug 25 '14 at 16:07
Why? You already have everything you need in your code, simply move the o.write into appropriate place and change the way string is formatted. — Tymoteusz Paul, Aug 25 '14 at 16:10
I fail to understand the logic behind `pairper`, do you overwrite it on each iteration of the first for loop? Also, I think there is some error, since you are using the variable `c` in that line, but it is actually defined later. Finally, correctly indenting the code would help too. — jdehesa, Aug 25 '14 at 16:31
@Puciek - are you referring to paircount? The dict is used to count occurrences of pair, you kinda have to wait until the counting's done before you write it. — tdelaney, Aug 25 '14 at 16:36
_"it takes ages to display even one result"_ 100000 rows don't seem that much by today standards. Out of curiosity, how long does that take? On what kind of hardware? — Sylvain Leroux, Aug 25 '14 at 16:39
@tdelaney from the very brief glance I am referring to pairper as pair is already finished then and those are the two used in o.write(). Of course I may be completely wrong on that part, but it's not really relevant as the data is there, it's matter of just restructuring the loop so the output will be a write to file, not append to a dict. — Tymoteusz Paul, Aug 25 '14 at 16:40
I'm with @javidcf - you are recreating the pairper dict for each line and only the last one is used. That's 99999 dicts that are created just to be discarded. And as paircount grows, the cost of building the dict goes up. I'm not sure what it's supposed to do either. — tdelaney, Aug 25 '14 at 17:02
@SylvainLeroux Yes it seems strange to me also. I am using a super computer in Lab not even laptop. But again with few rows around 20 lines the output is generated even into the file. The code does what i is exactly supposed to do. It generates correct results with small input file but with large file i was waiting for about 20 mins but NO output generated. — , Aug 25 '14 at 17:23
I know we used to brag about files with 100,000 rows and 3 columns back in the late 70s or something. In any case, what are you actually trying to do? Explain in words please. — Sinan Ünür, Aug 25 '14 at 17:27
@SitzBlogz I didn't read closely enough your code when I posted. I didn't noticed you have a "dictionary comprehension" inside your for loop. This in _O(n.log n)_. Or isn't it? — Sylvain Leroux, Aug 25 '14 at 17:28
Both the answers below work great. Thanks to Everyone. As i have to choose only one answer. I would prefer one with less code. — , Aug 26 '14 at 08:35

martineau · Answer 1 · 2014-08-27T12:00:57.233

The primary cause of the slowness is because you recreate theperpairdictionary for each line from thepaircountdictionary which grows larger and larger, which isn't necessary because only the value computed after all the lines are processed is ever used.

I don't fully understand what all the computations are, but here's something equivalent that should run much faster because it only creates thepairperdictionary once. I also simplified the logic a bit, although that probably didn't effect the run time very much either way, but I think it's easier to understand.

from collections import defaultdict
paircount = defaultdict(int)
pairtime = defaultdict(float)

#get number of pair occurrences and total time
with open('easy_input.txt', 'r') as f, open('easy_output.txt', 'w') as o:
    for numline, line in enumerate((line.split() for line in f), start=1):
        pair = line[0], line[1]
        paircount[pair] += 1
        pairtime[pair] += float(line[2])

    pairper = dict((pair, c * 100.0 / numline) for (pair, c)
                                                in paircount.iteritems())
    for pair, c in paircount.iteritems():
        #print pair[0], pair[1], c, pairper[pair], pairtime[pair]
        o.write("%s, %s, %s, %s, %s\n" % (pair[0], pair[1], c,
                                          pairper[pair], pairtime[pair]))
print 'done'

Your code works perfect it looks little complicated for someone like me who is very new to Python. — , Aug 27 '14 at 08:46

score 1 · Accepted Answer · answered Aug 25 '14 at 19:07

The pairper calculation is killing you and is not needed. You can use enumerate to count the input lines and just use that value at the end. This is similar to martineau's answer except that it doesn't pull the entire input list into memory (bad idea) or even calcuate pairper at all.

from collections import defaultdict
paircount = defaultdict(int)
pairtime = defaultdict(float)

#get number of pair occrences and total time 
with open('input.txt', 'r') as f:
  with open('output.txt', 'w') as o: 
    for numline, line in enumerate(f, 1):
        line = line.split()
        pair = line[0], line[1]
        paircount[pair] += 1
        pairtime[pair] += float(line[2])

    for pair, c in paircount.iteritems():
        #print pair[0], pair[1], c, pairper[pair], pairtime[pair]
        o.write("%s, %s, %s, %s, %s\n" % (pair[0], pair[1], c, c * 100.0 / numline, pairtime[pair]))

@tdelaney Can you please suggest me in the same code after finding the sum of time how can i get average of time and frequency also. — , Sep 05 '14 at 15:14

Easy way to compute Large data file python

2 Answers2