Regular expressions: re.sub(), \b and cyrillic characters

Question

I'm trying to replace whole appearance of cyrillic word in text:

# -*- coding: utf-8 -*-
import re
S = u"раз Два трИ".lower()
print re.sub(ur"\bдва\b", u"четыре", S, re.U)

Prints "раз два три" while "раз четыре три" is expected.

At the same time search() and findall() works well:

print re.search(ur"\bдва\b", S, re.U).group(0)
print re.findall(ur"\bдва\b", S, re.U)

So the only problem with re.sub()

Latin chars work well:

S = u"one Two threE".lower()
print re.sub(ur"\btwo\b", u"four", S, re.U)

If I try the following way, it swallows spaces (and looks ugly:

print re.sub(u"[^а-яё\d]два[^а-яё\d]", u"четыре", S)

A try to keep spaces doesn't work:

print re.sub(u"(?:[^а-яё\d])(два)(?:[^а-яё\d])", u"четыре", S)

Replace doesn't help too:

S = u"раз Два трИ".lower()
print S
S.replace(u"два", u"четыре")
print S

Prints "раз два три" two times.

score 1 · Accepted Answer · answered Mar 05 '14 at 05:17

1

You should pass flags with keyword argument flags:

In [3]: S = u"раз Два трИ".lower()
In [5]: print re.sub(ur"\bдва\b", u"четыре", S, flags=re.U)
раз четыре три

answered Mar 05 '14 at 05:17

Umair Khan

1 Answers1