2

I call open(file, "r") and read some lines in Python. This gives me:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 4: ordinal not in range(128)

If I add 'utf-8', I get:

'utf8' codec can't decode bytes in position 28-29: invalid continuation byte

If I add 'ISO-8859-1', I get no errors but a line is read like this:

2890 ready to try Argh� Fantasy Surfer Carnage� Dane, Marlon & Nat C all out!  #fantasysurfer

As you can see there are some extra characters, which probably come from emojis or something... (These are tweets)..

What is the best approach to clean these lines up?

I would like to remove all the extraneous elements... I would like the strings to have only numbers, letters, and common symbols ?!>.;, etc...

Note: I don't care about the html entities, since I replace those in another function. I am talking about the weird Argh� Carnage� elements.

In general, these are causing issues with the encoding.

Alastair McCormack
  • 26,573
  • 8
  • 77
  • 100
OHHH
  • 1,011
  • 3
  • 16
  • 34

3 Answers3

1

first, ensure that you especified the rigth codification at the first line in the python file.

# -*- coding: utf-8 -*-

Second, you can use the library codecs specifying the desired codification:

import codecs
fich_in = codecs.open(filename,'r', encoding='utf-8')

Third, you can to ignore all the wrong characters using:

TEXT.encode('utf-8', 'ignore').decode('utf-8')
henryr
  • 169
  • 1
  • 15
0

 
# -*- coding: latin-1 -*-

could help.

  • Your answer could be improved with additional supporting information. Please [edit] to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Jul 06 '23 at 06:39
-1

Try first use decode and then encode:

u"text".decode('latin-1').encode('utf-8')

Or try open file with codecs:

import codecs
with codecs.open('file', encoding="your coding")

Your problem is either opening the file in wrong encoding, or you incorrectly identify the character encoding.

Also if you get text in ASCII use it:

'abc'.decode('ascii')

or

unicode('abc', 'ascii')
JRazor
  • 2,707
  • 18
  • 27
  • This worked, I made it ascii, it removed those weird chars. – OHHH Jan 24 '16 at 01:01
  • 1
    `u"text".decode('latin-1')` that's mixed up. You encode *from* unicode and you decode *to* unicode. – spectras Jan 24 '16 at 01:20
  • I led the coding just for an example. I don't know the original text and I think topicstarter'll be able to choose. – JRazor Jan 24 '16 at 01:23