Python efficient mass replacing unknown characterers

Question

PHP4+mySQL4 based project post to Django 1.1 project and it mixes up some letters.
What is the best way (most efficient) to replace in this fashion?
The problem for me is that i cannot get values for those letters. Is there an online tool to do that?

I have textField with various letters and i want to replace those in this fashion:

àèæëáðøûþ => ąčęėįšųūž
ÀÈÆËÁÐØÛÞ => ĄČĘĖĮŠŲŪŽ

I had similar case where i had to clean up the code so i used this:

def clean(string):
     return ''.join([c for c in string if ord(c) > 31 or ord(c) in [9, 10, 13]] )

Update: i succeeded to extract Unicode values looking at Django debug messages (replace_from:replace_to):

{'\xe0':'\u0105', '\xe8':'\u010d', '\xe6':'\u0119', '\xeb':'\u0117', '\xe1':'\u012f',
 '\xf0':'\u0161', '\xf8':'\u0179', '\xfb':'\u016b', '\xfe':'\u017e',
 '\xc0':'\u0104', '\xc8':'\u010c', '\xc6':'\u0118', '\xcb':'\u0116', '\xc1':'\u012e',
 '\xd0':'\u0160', '\xd8':'\u0172', '\xdb':'\u016a', '\xde':'\u017d'

So the main problem remains - replacing

score 2 · Accepted Answer · edited Jun 20 '20 at 09:12

2

Try the str.replace() method - should work with unicode strings.

str.replace(old, new[, count])

Return a copy of the string with all occurrences of substring old replaced by new. If the optional argument count is given, only the first count occurrences are replaced.

Make sure your old and new strings are of type Unicode (that applies to your input data as well).

Find out what your input (non-unicode) string is supposed to be encoded in. For example, it may be in latin1 encoding. Use the builtin str.decode() method to create a Unicode version of your data, and feed that to str.replace().

>>> unioldchars = oldchars.decode("latin1")
>>> newdata = data.replace(unioldchars, newchars)

edited Jun 20 '20 at 09:12

Community

1
1

answered Jun 08 '11 at 15:25

gimel

83,368
10
76
104

this trows me an similar error to `UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2: ordinal not in range(128)` since the first char is not in UTF-8 (that i use) and nor in ascii encodings. – JackLeo Jun 08 '11 at 16:30
Make sure you are working on UNICODE strings, e.g. u'\u00e0\u00e8'. – gimel Jun 09 '11 at 05:04
By accident i ran into this: https://github.com/Kitto/python-osm/issues/1 All i needed is to force UTF-8 while launching project since Python expects to get ascii most of the time. – JackLeo Jun 09 '11 at 10:35

score 0 · Answer 2 · answered Jun 08 '11 at 15:22

0

I'd do it myself. The built-in replace function is of little use if you want multiple, efficient replacements.

Give this a look: http://code.activestate.com/recipes/81330-single-pass-multiple-replace/

EDIT: WAIT, you wanted to do the replacement client-side, like in the text-box?

answered Jun 08 '11 at 15:22

salezica

74,081
25
105
166

No in logic, before .save() method – JackLeo Jun 08 '11 at 15:26

score 0 · Answer 3 · edited Jun 20 '20 at 09:12

string.translate(s, table[, deletechars])

Delete all characters from s that are in deletechars (if present), and then translate the characters using table, which must be a 256-character string giving the translation for each character value, indexed by its ordinal. If table is None, then only the character deletion step is performed.

See also http://docs.python.org/library/string.html#string.maketrans

Python efficient mass replacing unknown characterers

3 Answers3