Text mining UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 1671718: character maps to

Question

I have written code to create frequency table. but it is breaking at the line ext_string = document_text.read().lower(. I even put a try and except to catch the error but it is not helping.

import re
import string
frequency = {}
file = open('EVG_text mining.txt', encoding="utf8")
document_text = open('EVG_text mining.txt', 'r')
text_string = document_text.read().lower()
match_pattern = re.findall(r'\b[a-z]{3,15}\b', text_string)
for word in match_pattern:
    try:
        count = frequency.get(word,0)
        frequency[word] = count + 1
    except UnicodeDecodeError:
        pass

frequency_list = frequency.keys()

for words in frequency_list:
    print (words, frequency[words])

Miquel Vande Velde · Answer 1 · 2018-07-24T18:38:54.160

You are opening your file twice, the second time without specifying the encoding:

file = open('EVG_text mining.txt', encoding="utf8")
document_text = open('EVG_text mining.txt', 'r')

You should open the file as follows:

frequencies = {}
with open('EVG_text mining.txt', encoding="utf8", mode='r') as f:
    text = f.read().lower()

match_pattern = re.findall(r'\b[a-z]{3,15}\b', text)
...

The second time you were opening your file, you were not defining what encoding to use which is probably why it errored. The with statement helps perform certain task linked with I/O for a file. You can read more about it here: https://www.pythonforbeginners.com/files/with-statement-in-python

You should probably have a look at error handling as well as you were not enclosing the line that was actually causing the error: https://www.pythonforbeginners.com/error-handling/

The code ignoring all decoding issues:

import re
import string  # Do you need this?

with open('EVG_text mining.txt', mode='rb') as f:  # The 'b' in mode changes the open() function to read out bytes.
    bytes = f.read()
    text = bytes.decode('utf-8', 'ignore') # Change 'ignore' to 'replace' to insert a '?' whenever it finds an unknown byte.

match_pattern = re.findall(r'\b[a-z]{3,15}\b', text)

frequencies = {}
for word in match_pattern:  # Your error handling wasn't doing anything here as the error didn't occur here but when reading the file.
    count = frequencies.setdefault(word, 0)
    frequencies[word] = count + 1

for word, freq in frequencies.items():
    print (word, freq)

Hi @miquel please refer re edit. I implemented the change u said but I am getting a new error UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc2 in position 71008: invalid continuation byte — , Jul 24 '18 at 16:38
Are you sure the file is encoded in utf-8? Look here for info on encoding/decoding bytes: https://docs.python.org/3/howto/unicode.html. — Miquel Vande Velde, Jul 24 '18 at 18:27
You can do two things. Check the file and try to find the right encoding. Or open the file as bytes and decode ignoring/replacing all bytes it doesn't recognise. I've edited your code to ignore all unknown bytes and refactored the rest of the code so it should work. Although the best way would still be to find out what characters it isn't decoding. — Miquel Vande Velde, Jul 24 '18 at 18:30
Thank you Miquel. the re edit solved the problem. Also, do you know any other library in python except for WordCloud that could help visualize the frequency of words? — , Jul 24 '18 at 19:43
@Joey No worries! No sorry I am not very familiar with visualisation tools in python. — Miquel Vande Velde, Jul 24 '18 at 20:34

AkshayRY · Answer 2 · 2020-11-16T18:44:42.560

-1

To read a file with some special characters, use encoding as 'latin1' or 'unicode_escape'

edited Nov 16 '20 at 18:44

answered Nov 15 '20 at 16:59

AkshayRY

31
6

Text mining UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 1671718: character maps to

2 Answers2