I call open(file, "r") and read some lines in Python. This gives me:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 4: ordinal not in range(128)
If I add 'utf-8', I get:
'utf8' codec can't decode bytes in position 28-29: invalid continuation byte
If I add 'ISO-8859-1', I get no errors but a line is read like this:
2890 ready to try Argh� Fantasy Surfer Carnage� Dane, Marlon & Nat C all out! #fantasysurfer
As you can see there are some extra characters, which probably come from emojis or something... (These are tweets)..
What is the best approach to clean these lines up?
I would like to remove all the extraneous elements... I would like the strings to have only numbers, letters, and common symbols ?!>.;, etc...
Note: I don't care about the html entities, since I replace those in another function. I am talking about the weird Argh� Carnage� elements.
In general, these are causing issues with the encoding.