3

How can I translate the following string

H.P. Dembinski, B. K\'{e}gl, I.C. Mari\c{s}, M. Roth, D. Veberi\v{c}

into

H. P. Dembinski, B. K\xe9gl, I. C. Mari\u015f, M. Roth, D. Veberi\u010d

?

Yu Hao
  • 119,891
  • 44
  • 235
  • 294
carl
  • 4,216
  • 9
  • 55
  • 103

2 Answers2

4

This code should handle the patterns you have in your example. It's now general enough to add the rest of those codes. Just put them into the table.

#!/usr/bin/python3
import re, unicodedata, sys

table = {
        'v': '\u030C',
        'c': '\u0327',
        "'": '\u0301'
        # etc...
        }

def despecial(s):
    return re.sub(r"\\(" + '|'.join(map(re.escape, table)) + r")\{(\w+)}",
            lambda m: m.group(2) + table[m.group(1)],
            s)

if __name__ == '__main__':
    print(unicodedata.normalize('NFC', despecial(' '.join(sys.argv[1:]))))

Example:

>>> despecial(r"H.P. Dembinski, B. K\'{e}gl, I.C. Mari\c{s}, M. Roth, D. Veberi\v{c}")
'H.P. Dembinski, B. Kégl, I.C. Mariş, M. Roth, D. Veberič'

Example (command line):

$ ./path/to/script.py "Hello W\v{o}rld"
Hello Wǒrld

It puts the appropriate Unicode combining character after the argument given. Specifically: U+0301 COMBINING ACUTE ACCENT, U+0327 COMBINING CEDILLA, and U+030C COMBINING CARON. If you want the string composed, you can just normalize it with unicodedata.normalize or something.

>>> import unicodedata
>>> unicodedata.normalize('NFC', despecial(r"H.P. Dembinski, B. K\'{e}gl, I.C. Mari\c{s}, M. Roth, D. Veberi\v{c}"))
'H.P. Dembinski, B. Kégl, I.C. Mariş, M. Roth, D. Veberič'

That said, I'm sure there's a better way to handle this. It looks like what you have is LaTeX code.

Functino
  • 1,939
  • 17
  • 25
  • How make auto pass raw argument, for example r"H.P. Dembinski, B. K\'{e}gl, I.C. Mari\c{s}, M. Roth, D. Veberi\v{c}" to despecial? – mmachine Sep 16 '15 at 03:48
  • 1
    @mmachine: The last edit put a thing in so it can be run from the command line, if that's what you wanted. If you wanted python to automagically ignore escape sequences in your string without being told to do so, then you're out of luck. – Functino Sep 16 '15 at 03:54
  • what about the case \"a or \'o... the solution above only detects the case \'{o}... I would have thought there is a pre-implemented conversion solution for that somewhere? Latex code is widely used isn't it? – carl Sep 16 '15 at 05:12
  • @carl: http://stackoverflow.com/questions/530121/how-do-i-convert-latex-to-plain-text-ascii – Functino Sep 16 '15 at 12:45
1
>>> s = "H.P. Dembinski, B. K\\'{e}gl, I.C. Mari\\c{s}, M. Roth, D. Veberi\\v{c}"
>>> s.replace(u"\\'{e}", u"\xe9").replace(u"\\c{s}", u"\u015f").replace(u"\\v{c}", u"\u010d")
u'H.P. Dembinski, B. K\xe9gl, I.C. Mari\u015f, M. Roth, D. Veberi\u010d'

That of course is the brute-force method. As you say you'll have many possible replacements, here's another way that's still brute-force but cleaner:

>>> table = ((u"\\'{e}", u"\xe9"), (u"\\c{s}", u"\u015f"), (u"\\v{c}", u"\u010d"))
>>> new = s
>>> for pattern, ch in table:
        new = new.replace(pattern, ch)
>>> new
u'H.P. Dembinski, B. K\xe9gl, I.C. Mari\u015f, M. Roth, D. Veberi\u010d'

Since there's a common pattern to the replacement string you can also take advantage of regular expressions.

>>> import re
>>> split = re.split(u"(\\\\['a-z]{[a-z]})", s)
>>> table = {u"\\'{e}": u"\xe9", u"\\c{s}": u"\u015f", u"\\v{c}": u"\u010d"}
>>> ''.join(table[piece] if piece in table else piece for piece in split)
u'H.P. Dembinski, B. K\xe9gl, I.C. Mari\u015f, M. Roth, D. Veberi\u010d'
Mark Ransom
  • 299,747
  • 42
  • 398
  • 622
  • ah yes that certainly works, but I was wondering whether there is a general solution, since there are many more symbols which can appear (the string above is just an example). – carl Sep 16 '15 at 03:12
  • @carl there isn't an obvious pattern between the input and output, so it's going to be based on some kind of large table. I'm trying to work that out now. – Mark Ransom Sep 16 '15 at 03:13
  • thanks a lot Mark. Do you know whether there is a complete table of these symbols somewhere? I don't seem to find anything online? – carl Sep 16 '15 at 03:25
  • @carl I've never seen this kind of character abbreviation before, and since I don't know where the string originated I can't help you. – Mark Ransom Sep 16 '15 at 03:28
  • @carl the other answer recognized the pattern, and Google didn't take long to find a table: https://en.wikibooks.org/wiki/LaTeX/Special_Characters#Escaped_codes – Mark Ransom Sep 16 '15 at 03:30