I am reading lines from a text file. The lines has two fields separated by the special character <xx>
.
The code:
result = open("myresult"+now.strftime("%d%m%Y-%H%M%S")+".txt","a")
inFile = open("test.txt","r")
x=1
for i in inFile:
print("line",str(x))
print(i)
print(i.split("<xx>",1)[1])
x=x+1
When python reads from the large file, the last line is parses is line 2060
after that it shows this error:
Traceback (most recent call last):
File "mycode.py", line 11, in <module>
for i in inFile: File "/usr/local/lib/python3.6/codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe2 in position 3901:
invalid continuation byte
When I extracted line 2061
from the input file test.txt
I found this string:
https://rrr.com<xx>{'Server': 'nginx', 'Date': 'Fri, 19 Apr 2019 06:01:30 GMT', 'Content-Type': 'text/html', 'Transfer-Encoding': 'chunked', 'Set-Cookie': 'btst=1c95019e21634b953d79e9124ec8a40a|127.0.0.1|1555653690|1555653690|0|1|0; path=/; domain=.rrr.com; Expires=Thu, 15 Apr 2027 00:00:00 GMT; HttpOnly; SameSite=Lax;, snkz=127.0.0.1; path=/; Expires=Thu, 15 Apr 2027 00:00:00 GMT', 'Content-Encoding': 'gzip'}
When I tried to put it in a separate file and parse it alone, I did not get an error.
Can anyone explain to me what is the problem? How to solve the issue so that python does not stop at this line?
EDIT:
Please note that I have records from various sources, i.e., they do not follow specific encoding that I know of. Is there anything universal that can solve the issue?
EDIT:
Based on one comment, I tried the following. The cursors hangs after ...
until I press enter.
Python 3.6.5 (default, Mar 15 2019, 05:40:52)
[GCC 7.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> with open("test.txt","rb") as f: print(repr(f.read()[3890:3910]))
...
b"sfer-Encoding': 'chu"