python special character decoding/encoding

Question

How can I translate the following string

H.P. Dembinski, B. K\'{e}gl, I.C. Mari\c{s}, M. Roth, D. Veberi\v{c}

into

H. P. Dembinski, B. K\xe9gl, I. C. Mari\u015f, M. Roth, D. Veberi\u010d

?

Functino · Answer 1 · 2015-09-16T03:53:08.463

This code should handle the patterns you have in your example. It's now general enough to add the rest of those codes. Just put them into the table.

#!/usr/bin/python3
import re, unicodedata, sys

table = {
        'v': '\u030C',
        'c': '\u0327',
        "'": '\u0301'
        # etc...
        }

def despecial(s):
    return re.sub(r"\\(" + '|'.join(map(re.escape, table)) + r")\{(\w+)}",
            lambda m: m.group(2) + table[m.group(1)],
            s)

if __name__ == '__main__':
    print(unicodedata.normalize('NFC', despecial(' '.join(sys.argv[1:]))))

Example:

>>> despecial(r"H.P. Dembinski, B. K\'{e}gl, I.C. Mari\c{s}, M. Roth, D. Veberi\v{c}")
'H.P. Dembinski, B. Kégl, I.C. Mariş, M. Roth, D. Veberič'

Example (command line):

$ ./path/to/script.py "Hello W\v{o}rld"
Hello Wǒrld

It puts the appropriate Unicode combining character after the argument given. Specifically: U+0301 COMBINING ACUTE ACCENT, U+0327 COMBINING CEDILLA, and U+030C COMBINING CARON. If you want the string composed, you can just normalize it with unicodedata.normalize or something.

>>> import unicodedata
>>> unicodedata.normalize('NFC', despecial(r"H.P. Dembinski, B. K\'{e}gl, I.C. Mari\c{s}, M. Roth, D. Veberi\v{c}"))
'H.P. Dembinski, B. Kégl, I.C. Mariş, M. Roth, D. Veberič'

That said, I'm sure there's a better way to handle this. It looks like what you have is LaTeX code.

How make auto pass raw argument, for example r"H.P. Dembinski, B. K\'{e}gl, I.C. Mari\c{s}, M. Roth, D. Veberi\v{c}" to despecial? — mmachine, Sep 16 '15 at 03:48
@mmachine: The last edit put a thing in so it can be run from the command line, if that's what you wanted. If you wanted python to automagically ignore escape sequences in your string without being told to do so, then you're out of luck. — Functino, Sep 16 '15 at 03:54
what about the case \"a or \'o... the solution above only detects the case \'{o}... I would have thought there is a pre-implemented conversion solution for that somewhere? Latex code is widely used isn't it? — carl, Sep 16 '15 at 05:12
@carl: http://stackoverflow.com/questions/530121/how-do-i-convert-latex-to-plain-text-ascii — Functino, Sep 16 '15 at 12:45

Mark Ransom · Answer 2 · 2015-09-16T03:27:59.797

1

>>> s = "H.P. Dembinski, B. K\\'{e}gl, I.C. Mari\\c{s}, M. Roth, D. Veberi\\v{c}"
>>> s.replace(u"\\'{e}", u"\xe9").replace(u"\\c{s}", u"\u015f").replace(u"\\v{c}", u"\u010d")
u'H.P. Dembinski, B. K\xe9gl, I.C. Mari\u015f, M. Roth, D. Veberi\u010d'

That of course is the brute-force method. As you say you'll have many possible replacements, here's another way that's still brute-force but cleaner:

>>> table = ((u"\\'{e}", u"\xe9"), (u"\\c{s}", u"\u015f"), (u"\\v{c}", u"\u010d"))
>>> new = s
>>> for pattern, ch in table:
        new = new.replace(pattern, ch)
>>> new
u'H.P. Dembinski, B. K\xe9gl, I.C. Mari\u015f, M. Roth, D. Veberi\u010d'

Since there's a common pattern to the replacement string you can also take advantage of regular expressions.

>>> import re
>>> split = re.split(u"(\\\\['a-z]{[a-z]})", s)
>>> table = {u"\\'{e}": u"\xe9", u"\\c{s}": u"\u015f", u"\\v{c}": u"\u010d"}
>>> ''.join(table[piece] if piece in table else piece for piece in split)
u'H.P. Dembinski, B. K\xe9gl, I.C. Mari\u015f, M. Roth, D. Veberi\u010d'

edited Sep 16 '15 at 03:27

answered Sep 16 '15 at 03:08

Mark Ransom

299,747
42
398
622

ah yes that certainly works, but I was wondering whether there is a general solution, since there are many more symbols which can appear (the string above is just an example). – carl Sep 16 '15 at 03:12
@carl there isn't an obvious pattern between the input and output, so it's going to be based on some kind of large table. I'm trying to work that out now. – Mark Ransom Sep 16 '15 at 03:13
thanks a lot Mark. Do you know whether there is a complete table of these symbols somewhere? I don't seem to find anything online? – carl Sep 16 '15 at 03:25
@carl I've never seen this kind of character abbreviation before, and since I don't know where the string originated I can't help you. – Mark Ransom Sep 16 '15 at 03:28
@carl the other answer recognized the pattern, and Google didn't take long to find a table: https://en.wikibooks.org/wiki/LaTeX/Special_Characters#Escaped_codes – Mark Ransom Sep 16 '15 at 03:30

python special character decoding/encoding

2 Answers2