-2

I am reading lines from a text file. The lines has two fields separated by the special character <xx>.

The code:

result = open("myresult"+now.strftime("%d%m%Y-%H%M%S")+".txt","a")
inFile = open("test.txt","r")

x=1
for i in inFile:
    print("line",str(x))
    print(i)
    print(i.split("<xx>",1)[1])
    x=x+1

When python reads from the large file, the last line is parses is line 2060 after that it shows this error:

Traceback (most recent call last):
    File "mycode.py", line 11, in <module>
      for i in inFile:   File "/usr/local/lib/python3.6/codecs.py", line 321, in decode
          (result, consumed) = self._buffer_decode(data, self.errors, final)
    UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe2 in position 3901:
        invalid continuation byte

When I extracted line 2061 from the input file test.txt I found this string:

https://rrr.com<xx>{'Server': 'nginx', 'Date': 'Fri, 19 Apr 2019 06:01:30 GMT', 'Content-Type': 'text/html', 'Transfer-Encoding': 'chunked', 'Set-Cookie': 'btst=1c95019e21634b953d79e9124ec8a40a|127.0.0.1|1555653690|1555653690|0|1|0; path=/; domain=.rrr.com; Expires=Thu, 15 Apr 2027 00:00:00 GMT; HttpOnly; SameSite=Lax;, snkz=127.0.0.1; path=/; Expires=Thu, 15 Apr 2027 00:00:00 GMT', 'Content-Encoding': 'gzip'}

When I tried to put it in a separate file and parse it alone, I did not get an error.

Can anyone explain to me what is the problem? How to solve the issue so that python does not stop at this line?

EDIT:

Please note that I have records from various sources, i.e., they do not follow specific encoding that I know of. Is there anything universal that can solve the issue?

EDIT: Based on one comment, I tried the following. The cursors hangs after ... until I press enter.

Python 3.6.5 (default, Mar 15 2019, 05:40:52) 
[GCC 7.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> with open("test.txt","rb") as f: print(repr(f.read()[3890:3910]))
... 
b"sfer-Encoding': 'chu"
None
  • 281
  • 1
  • 6
  • 16
  • @snakecharmerb After changing the encoding, it worked but stopped at later line which contains something like: `'جرÙx8aدة اÙx84Ùx85Ùx88جز` – None May 11 '19 at 16:04
  • 2
    If you do not know the encoding, why are you reading the files as text to begin with? The content cannot make sense to you. Read them as bytes if you just care about the data. – MisterMiyagi May 11 '19 at 16:08
  • Can I read them as ASCII and ignore any errors? – None May 11 '19 at 16:08
  • This is data from various sources for further analysis. I jsut need to read them now to insert them in DB later. – None May 11 '19 at 16:09

1 Answers1

1

The problem is that, in spite of what you believe, the file was created with an specific encoding. In order to meaningfully decoding the file you need to use the same encoding. Otherwise you are not guaranteed to get the text as intended by the creator of the file.

If losing some data is OK for you you can avoid the errors by using errors='ignore':

open("test.txt", "r", errors="ignore")

But as I said before you probably won't get the text as originally intended.

For more options on the errors argument run this code in a python console:

import codecs
help(codecs.Codec)

But again none of them will get you the text as intended if the encoding is wrong.

Regarding your questions about not losing data, if you don't know the original encoding you have already lose data. Not only the lines that cannot be read are problematic. Even if the line can be read without errors, there is no way you can tell whether the characters you read are the same characters that were in the original text, except maybe for lines that contain only ASCII characters.

Stop harming Monica
  • 12,141
  • 1
  • 36
  • 56
  • Thanks. I must not change anything in the data. I prefer to skip the problematic lines altogether. Does try/except seems the right thing to do? – None May 11 '19 at 16:14
  • But where to place them? Are these errors generated in the `open`? or in the `split` function? – None May 11 '19 at 16:15
  • AttributeError: module 'codecs' has no attribute 'Codecs' – None May 11 '19 at 16:24
  • Will try/except help? – None May 11 '19 at 16:32
  • Can you please clarify: will using `errors="ignore"` cause losing some data that can not be read? or will it changes anything? Also, plz clarify to me. Will try/except solve the problem? – None May 11 '19 at 16:59
  • @None Decoding with the wrong codec will change things, sometimes it will be unable to read a character and will raise an error (that you can ignore or handle in several ways), sometimes it will happily and silently read the wrong character. – Stop harming Monica May 11 '19 at 17:26