0

I try to import data from a database encodet in "latin1", change to "unicode" and import them into my app. Normaly this is no problem. But now I have some new data with a field with a strange character = "\x17"

How do I deal with this in Python?

What I made now is a function for replacing this data. But I think there are much better ways then this:

def replace_problem_characters(self, text):
    replace_store = {u"\x17" : ""}
    for key, value in replace_store.items():
        if key in text:
            text = text.replace(key, value)
    return text
oxidworks
  • 1,563
  • 1
  • 14
  • 37
  • [Check this out](http://stackoverflow.com/questions/2672326/what-does-a-leading-x-mean-in-a-python-string-xaa) are you sure you don't need that data? – Priyank Dec 15 '16 at 12:58
  • In this case, yes. Because it is a persons name I can see also in web interface correctly. It is from a Lithuania person. Maybe he copy and paste from a text document with local encoding? – oxidworks Dec 15 '16 at 14:03
  • @oxidworks \x17 is a control character, present in most encodings, including ascii. Copy and paste not likely. Perhaps his IME allows input of control characters ... – John Machin Dec 18 '16 at 05:01

1 Answers1

0

If the database is encoded in "latin", why are you messing with utf-8? Note that in line 4 of your code snippet text is presumed to be encoded in latin but in line 5 the fixed record ends up encoded in utf-8.

When accessing text columns in your database: 1. If not done for you, immediately decode from latin into Unicode. 2. Process your text using Unicode methods. 3. If not done for you, encode your Unicode text into latin just before updating the database.

For data like names, you are highly likely not to want any of the 32 C0 controls (\x00 up to \x1f).

If your database is truly latin aka latin_1 aka ISI-8859-1, you don't want the 32 C1 controls (\x80 up to \x9f). However if you find that you are having these in your database, then it is likely that you should have been using cp1252 or similar which treats \x80 up to \x9f as valid data points with more accented letters and punctuation.

And in any case it would be a lot better if the database was encoded in utf-8, and if you could use Python 3.x instead of 2.7.

John Machin
  • 81,303
  • 11
  • 141
  • 189
  • Thank you, I now changing encoding from latin1 to unicode directly after reading from database. Question is edited. Changing database to "utf8" also planned since a while but time is running very fast the last years :) – oxidworks Dec 18 '16 at 17:11