Python read huge file line by line with utf-8 encoding

Question

I want to read some quite huge files(to be precise: the google ngram 1 word dataset) and count how many times a character occurs. Now I wrote this script:

import fileinput
files = ['../../datasets/googlebooks-eng-all-1gram-20090715-%i.csv' % value for value in range(0,9)]
charcounts = {}
lastfile = ''
for line in fileinput.input(files):
    line = line.strip()
    data = line.split('\t')
    for character in list(data[0]):
        if (not character in charcounts):
            charcounts[character] = 0
        charcounts[character] += int(data[1])
    if (fileinput.filename() is not lastfile):
        print(fileinput.filename())
        lastfile = fileinput.filename()
    if(fileinput.filelineno() % 100000 == 0):
        print(fileinput.filelineno())
print(charcounts)

which works fine, until it reaches approx. line 700.000 of the first file, I then get this error:

../../datasets/googlebooks-eng-all-1gram-20090715-0.csv
100000
200000
300000
400000
500000
600000
700000
Traceback (most recent call last):
  File "charactercounter.py", line 5, in <module>
    for line in fileinput.input(files):
  File "C:\Python31\lib\fileinput.py", line 254, in __next__
    line = self.readline()
  File "C:\Python31\lib\fileinput.py", line 349, in readline
    self._buffer = self._file.readlines(self._bufsize)
  File "C:\Python31\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 7771: cha
racter maps to <undefined>

To solve this I searched the web a bit, and came up with this code:

import fileinput
files = ['../../datasets/googlebooks-eng-all-1gram-20090715-%i.csv' % value for value in range(0,9)]
charcounts = {}
lastfile = ''
for line in fileinput.input(files,False,'',0,'r',fileinput.hook_encoded('utf-8')):
    line = line.strip()
    data = line.split('\t')
    for character in list(data[0]):
        if (not character in charcounts):
            charcounts[character] = 0
        charcounts[character] += int(data[1])
    if (fileinput.filename() is not lastfile):
        print(fileinput.filename())
        lastfile = fileinput.filename()
    if(fileinput.filelineno() % 100000 == 0):
        print(fileinput.filelineno())
print(charcounts)

but the hook I now use tries to read the entire, 990MB, file into the memory at once, which kind of crashes my pc. Does anyone know how to rewrite this code so that it actually works?

p.s: the code hasn't even run all the way yet, so I don't even know if it does what it has to do, but for that to happen I first need to fix this bug.

Oh, and I use Python 3.2

performance comparison of counting chars in Python, Cython, C, .. http://stackoverflow.com/questions/2522152/python-is-a-dictionary-slow-to-find-frequency-of-each-character/2525617#2525617 — jfs, Mar 30 '11 at 23:52
For the interested folks, the result(added a filter for only a-z and made everything lowercase): {'a': 102037493781, 'c': 42883014812, 'b': 19831999435, 'e': 160625131890, 'd':49858005683, 'g': 23703400644, 'f': 32139997560, 'i': 97477105220, 'h': 63989934675, 'k': 7050807601, 'j': 2260108213, 'm': 32292575753, 'l': 52782661506, 'o':100366604971, 'n': 93886203967, 'q': 1622282068, 'p': 27264105140, 's': 85883631327, 'r': 80049264186, 'u': 35187497669, 't': 114609472329, 'w': 21891971718, 'v': 13296202464, 'y': 21467638892, 'x': 3007834707, 'z': 1333102460} — teuneboon, Mar 31 '11 at 00:58

codeape · Accepted Answer · 2011-03-30T22:30:48.327

8

I do not know why fileinput does not work as expected.

I suggest you use the open function instead. The return value can be iterated over and will return lines, just like fileinput.

The code will then be something like:

for filename in files:
    print(filename)
    for filelineno, line in enumerate(open(filename, encoding="utf-8")):
        line = line.strip()
        data = line.split('\t')
        # ...

Some documentation links: enumerate, open, io.TextIOWrapper (open returns an instance of TextIOWrapper).

edited Mar 30 '11 at 22:30

answered Mar 30 '11 at 22:24

codeape

97,830
24
159
188

That worked, thanks, that it doesn't work with fileinput might be a bug in python? – teuneboon Mar 30 '11 at 22:28
Yes, it very well could be. It seems a bit strange. It could also be that one of the options you pass to fileinput makes it behave in this way. I don't know enough about fileinput to know. – codeape Mar 30 '11 at 22:32
@teuneboon: It appears that your file is proper ASCII. This is a default assumption in a lot of Python packages. – S.Lott Mar 31 '11 at 01:18
1

It's a bug in the fileinput/codecs packages. Fileinput calls stream.readlines - but gives it a buffer size to fill - codecs.StreamReader.readlines explicitly ignores the buffer size and reads the whole file. – gromgull Feb 03 '14 at 15:20
Is there a bug filed somewhere indicating whether this is being worked on? – Fred Jul 30 '14 at 22:37

score 2 · Answer 2 · answered Mar 31 '11 at 20:37

The problem is that fileinput doesn't use file.xreadlines(), which reads line by line, but file.readline(bufsize), which reads bufsize bytes at once (and turns that into a list of lines). You are providing 0 for the bufsize parameter of fileinput.input() (which is also the default value). Bufsize 0 means that the whole file is buffered.

Solution: provide a reasonable bufsize.

score 1 · Answer 3 · answered Mar 21 '13 at 11:24

1

This works for me: you can use "utf-8" in the hook definition. I used it on a 50GB/200M lines file with no problem.

fi = fileinput.FileInput(openhook=fileinput.hook_encoded("iso-8859-1"))

answered Mar 21 '13 at 11:24

oba

2,579
1
21
15

score 0 · Answer 4 · answered Mar 30 '11 at 22:22

0

Could you try to read not a whole file, but a part of it as binary, then decode(), then proccess, then call the function again to read another part?

answered Mar 30 '11 at 22:22

fogbit

1,961
6
27
41

dfb · Answer 5 · 2011-03-30T22:40:40.950

I don't if the one I have is the latest version (and I don't remember how I read them), but...

$ file -i googlebooks-eng-1M-1gram-20090715-0.csv 
googlebooks-eng-1M-1gram-20090715-0.csv: text/plain; charset=us-ascii

Have you tried fileinput.hook_encoded('ascii') or fileinput.hook_encoded('latin_1')? Not sure why this would make a difference, since I think the these are just subsets of unicode with the same mapping, but worth a try.

EDIT I think this might be a bug in fileinput, neither of these work.

score 0 · Answer 6 · answered Mar 31 '11 at 00:14

If you are worried about the mem usage, why not read by line using readline()? This will get rid of the memory issues you are running into. Currently you are reading the full file before performing any actions on the fileObj, with readline() you are not saving the data, merely searching it on a per-line basis.

def charCount1(_file, _char):
  result = []
  file   = open(_file, encoding="utf-8")
  data   = file.read()
  file.close()
  for index, line in enumerate(data.split("\n")):
    if _char in line:
      result.append(index)
  return result

def charCount2(_file, _char):
  result = []
  count  = 0
  file   = open(_file, encoding="utf-8")
  while 1:
    line = file.readline()
    if _char in line:
      result.append(count)
    count += 1
    if not line: break
  file.close()
  return result

I didn't have a chance to really look over your code but the above samples should give you an idea of how to make the appropriate changes to your structure. charCount1() demonstrates your method which caches the entire file in a single call from read(). I tested your method out on a +400MB text file and the python.exe process went as high as +900MB. when you run charCount2(), the python.exe process shouldn't exceed more than a few MB's (provided you haven't bulked up the size with other code) ;)

fyi, _file="C:\\your\\bulky\\file.name" and _char= the character to search for ;) — AWainb, Mar 31 '11 at 00:18

Python read huge file line by line with utf-8 encoding

6 Answers6