How to remove all conflicting characters between latin1 and utf-8 using python?

Question

I call open(file, "r") and read some lines in Python. This gives me:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 4: ordinal not in range(128)

If I add 'utf-8', I get:

'utf8' codec can't decode bytes in position 28-29: invalid continuation byte

If I add 'ISO-8859-1', I get no errors but a line is read like this:

2890 ready to try Arghï¿½ Fantasy Surfer Carnageï¿½ Dane, Marlon &amp; Nat C all out!  #fantasysurfer

As you can see there are some extra characters, which probably come from emojis or something... (These are tweets)..

What is the best approach to clean these lines up?

I would like to remove all the extraneous elements... I would like the strings to have only numbers, letters, and common symbols ?!>.;, etc...

Note: I don't care about the html entities, since I replace those in another function. I am talking about the weird Arghï¿½ Carnageï¿½ elements.

In general, these are causing issues with the encoding.

Can't you find out what encoding has been used for the file originally? — Tom Dalton, Jan 24 '16 at 00:54
Your data actually is UTF-8. Can you provide the code you use to read it? And maybe copy-paste some data as well? — spectras, Jan 24 '16 at 01:29

score 1 · Answer 1 · answered Jan 24 '16 at 01:11

first, ensure that you especified the rigth codification at the first line in the python file.

# -*- coding: utf-8 -*-

Second, you can use the library codecs specifying the desired codification:

import codecs
fich_in = codecs.open(filename,'r', encoding='utf-8')

Third, you can to ignore all the wrong characters using:

TEXT.encode('utf-8', 'ignore').decode('utf-8')

Gonzalo Roncedo · Answer 2 · 2023-07-04T15:55:59.173

0

 
# -*- coding: latin-1 -*-

could help.

edited Jul 04 '23 at 15:55

answered Jul 04 '23 at 15:51

Gonzalo Roncedo

Your answer could be improved with additional supporting information. Please [edit] to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Jul 06 '23 at 06:39

JRazor · Accepted Answer · 2016-01-24T01:13:14.073

-1

Try first use decode and then encode:

u"text".decode('latin-1').encode('utf-8')

Or try open file with codecs:

import codecs
with codecs.open('file', encoding="your coding")

Your problem is either opening the file in wrong encoding, or you incorrectly identify the character encoding.

Also if you get text in ASCII use it:

'abc'.decode('ascii')

or

unicode('abc', 'ascii')

edited Jan 24 '16 at 01:13

answered Jan 24 '16 at 01:00

JRazor

This worked, I made it ascii, it removed those weird chars. – OHHH Jan 24 '16 at 01:01
1

`u"text".decode('latin-1')` that's mixed up. You encode *from* unicode and you decode *to* unicode. – spectras Jan 24 '16 at 01:20
I led the coding just for an example. I don't know the original text and I think topicstarter'll be able to choose. – JRazor Jan 24 '16 at 01:23

3 Answers3