0

I want to read a big file (1.4 GB) using python 2.7.6 (32 bits). However, the solution I tried doesn't work. For instance, if I run the following code :

os.system('dir') # gives me 20/03/2014  16:55     1 414 488 081 book.review
with open('book.review') as f:
    tmp = 0
    for line in f:
        print line
        for c in line:
                tmp += 1

The value of tmp is 6642.

On top of that, the last printed line is

Like Calhoun the support for states rights remains a complete fa

while the corresponding line is

Like Calhoun the support for states rights remains a complete fa^Zade. I found Russell Kirk's salesmanship of Conservativism generally repellent but recommend the book because it remains a fairly enlightening view of an ideology that continues to thrive to this day.

On top of that, this is the 93th line in a 39001831 lines file.

I could really use an new eye on this problem, I really don't understand what happens.

Update

The trouble comes obviously from this ^Z (not spotted at the time of questionning). However, I can't manage to get rid of it (things like line.replace('^Z', '')) aren't enough.

merours
  • 4,076
  • 7
  • 37
  • 69
  • At a guess the first character that's missing is: ç. I'm guessing your issue is something to do with character encoding. – Phylogenesis Mar 20 '14 at 17:00
  • It is not (for what I see, at least). The final pritned word is `fa` while it should be `faade`. – merours Mar 20 '14 at 17:01
  • Reading the file with vi, it appears a discrete `^Z` is located between the two aforementionned 'a'. Would you know a way to get rid of this ? – merours Mar 20 '14 at 17:06
  • 1
    This might help to remove the EOF characters from the file - http://stackoverflow.com/questions/20695336/how-to-process-huge-text-files-that-contain-eof-ctrl-z-characters-using-python – isubuz Mar 20 '14 at 17:42
  • It does indeed, thanks. – merours Mar 20 '14 at 17:47

0 Answers0