Count every character from file

Question

I am trying to count every character from a file and put it in a dictionary. But it doesn't quite work, I don't get all characters.

#!/usr/bin/env python
import os,sys

def count_chars(p):
     indx = {}
     file = open(p)

     current = 0
     for ch in file.readlines():
          c = ch[current:current+1]
          if c in indx:
               indx[c] = indx[c]+1
          else:
               indx[c] = 1           
          current+=1
     print indx

if len(sys.argv) > 1:
     for e in sys.argv[1:]:
          print e, "contains:"
          count_chars(e)
else:
     print "[#] Usage: ./aufg2.py <filename>"

score 7 · Accepted Answer · answered Jan 05 '13 at 21:13

7

Assuming the file you're counting fits reasonably in memory:

import collections
with open(p) as f:
    indx = collections.Counter(f.read())

Otherwise, you can read it bit by bit:

import collections
with open(p) as f:
    indx = collections.Counter()
    buffer = f.read(1024)
    while buffer:
        indx.update(buffer)
        buffer = f.read(1024)

answered Jan 05 '13 at 21:13

Amber

507,862
82
626
550

Or make it a bit more explicit and let the buffering occur naturally - something like `Counter(iter(lambda: f.read(1), ''))` – Jon Clements Jan 05 '13 at 21:26
@JonClements Isn't reading small amount of bytes much slower than reading from file in chunks? – ovgolovin Jan 05 '13 at 21:35
@ovgolovin No... for a buffered input stream, you just read into the buffer, then it overflows naturally... – Jon Clements Jan 05 '13 at 21:37
And a hair-brained idea would be to use [`mmap`](http://docs.python.org/2/library/mmap.html), which would do all the buffering. But I haven't given it any research, so can't form it as a separate answer. – ovgolovin Jan 05 '13 at 21:37
@ovgolovin It'd work, but in this use-case - not the best of options – Jon Clements Jan 05 '13 at 21:48
@JonClements Your proposed solution still results in far more function calls (to `.read(1)`) which can slow things down (simply due to the overhead of setting up the function stack for each call, et cetera). Choosing a reasonable amount to read all at once reduces the function call overhead while still letting the built-in buffering play in if Python chooses to use a larger buffer. – Amber Jan 06 '13 at 00:09

score 2 · Answer 2 · answered Jan 05 '13 at 21:09

The main problem is that you only examine (at most!) one character from every line. If you're reading the file line by line, you need to have an inner loop that would iterate over the line's characters.

#!/usr/bin/env python
import os, sys, collections

def count_chars(p):
     indx = collections.Counter()
     with open(p) as f:
         for line in f:
             for c in line:
                 indx[c] += 1
     print indx

if len(sys.argv) > 1:
     for e in sys.argv[1:]:
          print e, "contains:"
          count_chars(e)
else:
     print "[#] Usage: ./aufg2.py <filename>"

score 1 · Answer 3 · answered Jan 05 '13 at 21:16

1

Use a defaultdict. Basically, if you try to get a nonexistent item in a defaultdict, it creates the key and calls the 0th argument specified by the constructor to be used as the value.

import collections

def count_chars(p):
    d = collections.defaultdict(int)
    for letter in open(p).read():
        d[letter] += 1
    return d

answered Jan 05 '13 at 21:16

riamse

351
1
4

1

`defaultdict` is the wrong data structure for this, given that `Counter` exists in the very same module. – Amber Jan 05 '13 at 21:17

score 1 · Answer 4 · answered Jan 05 '13 at 21:40

1

I've posted this as a comment to @Amber's answer, but will re-iterate here...

To count the occurences of bytes in a file, then generate a small iterator:

with open('file') as fin:
    chars = iter(lambda: fin.read(1), '')
    counts = Counter(chars)

This way the the underlying buffering from fin still applies, but it remains more implicit that you're reading one byte at a time (instead of a block size, which the OS will do on its own regardless anyway), it also allows not using update on the Counter object, and in effect becomes more of a complete, stand-alone, instruction.

answered Jan 05 '13 at 21:40

Jon Clements

138,671
33
247
280

+1! But there might be a problem. As reading by 1 byte may be pretty slow. I faced this problem answering this question: http://stackoverflow.com/a/8284246/862380 – ovgolovin Jan 05 '13 at 21:55
@ovgolovin interesting - it's possible a `partial(fin.read, 1)` instead of a lambda might be better as to function call over-head, but apart from that, I see no reason if using buffering, it would be any slower than reading chunks (although don't dispute it) - one could also specify the buffer size on the `open`... – Jon Clements Jan 05 '13 at 21:59
I'm not sure about `partial` as it probably invokes `fin.read` on each byte reading. And I commented just because I don't fully understand these issues and I hoped I can learn something new! That time I was really confused by 60 ratio between reading in chunks and reading by 1 byte. And one more thing, these benchmarks in Python are shaky as there are changes from version to version of Python that move the figures dramatically. – ovgolovin Jan 05 '13 at 22:04
@ovgolovin I have to get up early, so I'm afraid it's time I retired for the evening, but yeah the first release of Py3 was horrendous in bechmarks regarding file IO, and bugs have been fixed in buffering and IO other the years, and partial'ing a function is slightly different than lambda'ing... there's plenty of discussion out there... Just that'd it be a question in itself! – Jon Clements Jan 05 '13 at 22:08

Count every character from file

4 Answers4