python check if utf-8 string is uppercase

Question

I am having trouble with .isupper() when I have a utf-8 encoded string. I have a lot of text files I am converting to xml. While the text is very variable the format is static. words in all caps should be wrapped in <title> tags and everything else <p>. It is considerably more complex then this, but this should be sufficent for my question.

My problem is that this is an utf-8 file. This is a must, as there will be ~~some~~ many non-English characters in the final output. This may be time to provide a brief example:

inputText.txt

RÉSUMÉ

Bacon ipsum dolor sit amet strip steak t-bone chicken, irure ground round nostrud aute pancetta ham hock incididunt aliqua. Dolore short loin ex chicken, chuck drumstick ut hamburger ut andouille. In laborum eiusmod short loin, spare ribs enim ball tip sausage. Tenderloin ut consequat flank. Tempor officia sirloin duis. In pancetta do, ut dolore t-bone sint pork pariatur dolore chicken exercitation. Nostrud ribeye tail, ut ullamco venison mollit pork chop proident consectetur fugiat reprehenderit officia ut tri-tip.

DesiredOutput

    <title>RÉSUMÉ</title>
    <p>Bacon ipsum dolor sit amet strip steak t-bone chicken, irure ground round nostrud
       aute pancetta ham hock incididunt aliqua. Dolore short loin ex chicken, chuck drumstick
       ut hamburger ut andouille. In laborum eiusmod short loin, spare ribs enim ball tip sausage.
       Tenderloin ut consequat flank. Tempor officia sirloin duis. In pancetta do, ut dolore t-bone
       sint pork pariatur dolore chicken exercitation. Nostrud ribeye tail, ut ullamco venison
       mollit pork chop proident consectetur fugiat reprehenderit officia ut tri-tip.
   </p>

Sample Code

    #!/usr/local/bin/python2.7
    # yes this is an alt-install of python

    import codecs
    import sys
    import re
    from xml.dom.minidom import Document

    def main():
        fn = sys.argv[1]
        input = codecs.open(fn, 'r', 'utf-8')
        output = codecs.open('desiredOut.xml', 'w', 'utf-8')
        doc = Documents()
        doc = parseInput(input,doc)
        print>>output, doc.toprettyxml(indent='  ',encoding='UTF-8')

    def parseInput(input, doc):
        tokens = [re.split(r'\b', line.strip()) for line in input if line != '\n'] #remove blank lines

        for i in range(len(tokens)):
            # THIS IS MY PROBLEM. .isupper() is never true.
            if str(tokens[i]).isupper(): 
                 title = doc.createElement('title')
                 tText = str(tokens[i]).strip('[\']')
                 titleText = doc.createTextNode(tText.title())
                 doc.appendChild(title)
                 title.appendChild(titleText)
            else: 
                p = doc.createElement('p')
                pText = str(tokens[i]).strip('[\']')
                paraText = doc.createTextNode(pText)
                doc.appendChild(p)
                p.appenedChild(paraText)

       return doc

if __name__ == '__main__':
    main()

ultimately it is pretty straight forward, I would accept critiques or suggestions on my code. Who wouldn't? In particular I am unhappy with str(tokens[i]) perhaps there is a better way to loop through a list of strings?

But the purpose of this question is to figure out the most efficient way to check if an utf-8 string is capitalized. Perhaps I should look into crafting a regex for this.

Do note, I did not run this code and it may not run just right. I hand picked the parts from working code and may have mistyped something. Alert me and I will correct it. lastly, note I am not using lxml

Is there a reason you're using `str()` rather than `unicode()`? — Velociraptors, Jun 17 '11 at 20:44
yes, in an older version of this code I was not opening this through codecs.open just open. Perhaps, I should use unicode(). in fact, h/o while I try that. — matchew, Jun 17 '11 at 20:46
well, it doesn't solve the problem. But perhaps I should use unicode() good catch. — matchew, Jun 17 '11 at 20:48
`isupper()` is locale-dependent for 8-bit strings; I thought that may have been part of the problem — Velociraptors, Jun 17 '11 at 20:51
The reason you’re having trouble is because ***Python is simply unsuitable for Unicode work!*** Behold: `python -c 'print u"\u216F\u216F\u216A".isupper()'` purports `False`, while `perl -E 'say "\x{216F}\x{216F}\x{216A}" =~ /^\p{upper}+$/ ? "True" : "False"'` correctly reports `True`. `python -c 'print u"\u01C8\u1FA9".istitle()'` gives an incorrect `False`, while `perl -E 'say "\x{01C8}\x{1FA9}" =~ /^\p{title}+$/ ? "True" : "False"'` correctly reports `True`. Python is fine for 7-bit ASCII, but as these tests how, if you expect to actually work in Unicode, you’ll need to upgrade to Perl. HTH! — tchrist, Jun 17 '11 at 22:16
@tchrist - according to this website, the Roman numeral characters are neither uppercase nor lowercase, making the False result in Python correct for isupper(): http://www.fileformat.info/info/unicode/char/216a/index.htm. I did not verify for istitle() — wberry, Jun 17 '11 at 22:42
@wberry: That's simply wrong I've worked with this stuff for years! The capital Roman numbers have properties Nl, Alphabetic, Cased, Changes_When_Casefolded, Changes_When_Lowercased, Other_Uppercase, Uppercase, and many more. Ignore that lameass-site. Read The Unicode Standard™, currently version 6. The file `PropList.txt` from The Unicode Standard™ clearly states that `PropList.txt:2160..216F ; Other_Uppercase # Nl [16] ROMAN NUMERAL ONE..ROMAN NUMERAL ONE THOUSAND`. etc. — tchrist, Jun 18 '11 at 01:55
It is remarkable to see people uptick provably wrong comments. My statement is correct, as clearly dictated by The Unicode Standard™, which is **the** authoritative source in this matter. Python has simply gotten it wrong. The question is, when will this bug be fixed? — tchrist, Jun 20 '11 at 15:11
I'm not sure why Uppercase and Other_Uppercase are different classes in the Unicode standard. But by appearances, Perl may be assuming they are effectively both "uppercase" and Python may be testing strictly for membership in the Uppercase (not Other_Uppercase) class. Perhaps Python 3.x has an updated check now, it's been several years since this. — wberry, Nov 12 '15 at 00:41

John Machin · Accepted Answer · 2011-06-18T22:40:04.590

The primary reason that your published code fails (even with only ascii characters!) is that re.split() will not split on a zero-width match. r'\b' matches zero characters:

>>> re.split(r'\b', 'foo-BAR_baz')
['foo-BAR_baz']
>>> re.split(r'\W+', 'foo-BAR_baz')
['foo', 'BAR_baz']
>>> re.split(r'[\W_]+', 'foo-BAR_baz')
['foo', 'BAR', 'baz']

Also, you need flags=re.UNICODE to ensure that Unicode definitions of \b and \W etc are used. And using str() where you did is at best unnecessary.

So it wasn't really a Unicode problem per se at all. However some answerers tried to address it as a Unicode problem, with varying degrees of success ... here's my take on the Unicode problem:

The general solution to this kind of problem is to follow the standard bog-simple advice that applies to all text problems: Decode your input from bytestrings to unicode strings as early as possible. Do all processing in unicode. Encode your output unicode into byte strings as late as possible.

So: byte_string.decode('utf8').isupper() is the way to go. Hacks like byte_string.decode('ascii', 'ignore').isupper() are to be avoided; they can be all of (complicated, unneeded, failure-prone) -- see below.

Some code:

# coding: ascii
import unicodedata

tests = (
    (u'\u041c\u041e\u0421\u041a\u0412\u0410', True), # capital of Russia, all uppercase
    (u'R\xc9SUM\xc9', True), # RESUME with accents
    (u'R\xe9sum\xe9', False), # Resume with accents
    (u'R\xe9SUM\xe9', False), # ReSUMe with accents
    )

for ucode, expected in tests:
    print
    print 'unicode', repr(ucode)
    for uc in ucode:
        print 'U+%04X %s' % (ord(uc), unicodedata.name(uc))
    u8 = ucode.encode('utf8')
    print 'utf8', repr(u8)
    actual1 = u8.decode('utf8').isupper() # the natural way of doing it
    actual2 = u8.decode('ascii', 'ignore').isupper() # @jathanism
    print expected, actual1, actual2

Output from Python 2.7.1:

unicode u'\u041c\u041e\u0421\u041a\u0412\u0410'
U+041C CYRILLIC CAPITAL LETTER EM
U+041E CYRILLIC CAPITAL LETTER O
U+0421 CYRILLIC CAPITAL LETTER ES
U+041A CYRILLIC CAPITAL LETTER KA
U+0412 CYRILLIC CAPITAL LETTER VE
U+0410 CYRILLIC CAPITAL LETTER A
utf8 '\xd0\x9c\xd0\x9e\xd0\xa1\xd0\x9a\xd0\x92\xd0\x90'
True True False

unicode u'R\xc9SUM\xc9'
U+0052 LATIN CAPITAL LETTER R
U+00C9 LATIN CAPITAL LETTER E WITH ACUTE
U+0053 LATIN CAPITAL LETTER S
U+0055 LATIN CAPITAL LETTER U
U+004D LATIN CAPITAL LETTER M
U+00C9 LATIN CAPITAL LETTER E WITH ACUTE
utf8 'R\xc3\x89SUM\xc3\x89'
True True True

unicode u'R\xe9sum\xe9'
U+0052 LATIN CAPITAL LETTER R
U+00E9 LATIN SMALL LETTER E WITH ACUTE
U+0073 LATIN SMALL LETTER S
U+0075 LATIN SMALL LETTER U
U+006D LATIN SMALL LETTER M
U+00E9 LATIN SMALL LETTER E WITH ACUTE
utf8 'R\xc3\xa9sum\xc3\xa9'
False False False

unicode u'R\xe9SUM\xe9'
U+0052 LATIN CAPITAL LETTER R
U+00E9 LATIN SMALL LETTER E WITH ACUTE
U+0053 LATIN CAPITAL LETTER S
U+0055 LATIN CAPITAL LETTER U
U+004D LATIN CAPITAL LETTER M
U+00E9 LATIN SMALL LETTER E WITH ACUTE
utf8 'R\xc3\xa9SUM\xc3\xa9'
False False True

The only differences with Python 3.x are syntactical -- the principle (do all processing in unicode) remains the same.

This answer, so far, has proved the most helpful to my problem. Thank you. — matchew, Jun 20 '11 at 16:01

wberry · Answer 2 · 2011-06-20T13:49:08.293

As one comment above illustrates, it is not true for every character that one of the checks islower() vs isupper() will always be true and the other false. Unified Han characters, for example, are considered "letters" but are not lowercase, not uppercase, and not titlecase.

So your stated requirements, to treat upper- and lower-case text differently, should be clarified. I will assume the distinction is between upper-case letters and all other characters. Perhaps this is splitting hairs, but you ARE talking about non-English text here.

First, I do recommend using Unicode strings (the unicode() built-in) exclusively for the string processing portions of your code. Discipline your mind to think of the "regular" strings as byte-strings, because that's exactly what they are. All string literals not written u"like this" are byte-strings.

This line of code then:

tokens = [re.split(r'\b', line.strip()) for line in input if line != '\n']

would become:

tokens = [re.split(u'\\b', unicode(line.strip(), 'UTF-8')) for line in input if line != '\n']

You would also test tokens[i].isupper() rather than str(tokens[i]).isupper(). Based on what you have posted, it seems likely that other portions of your code would need to be changed to work with character strings instead of byte-strings.

I wont be able to test this solution until I return to the office, but it seems like this may also be a viable solution. The solution I had posted worked. But this may function better. Thanks. — matchew, Jun 17 '11 at 23:33
-1 for 2 reasons: (1) `re.split(r'\b', ...)` doesn't work. (2) `unicode(blahblah)` relying on the default encoding being UTF-8 -- it's ascii on e.g. Windows boxes and in any case sysadmins can fiddle with site.py or whatever to change it. — John Machin, Jun 18 '11 at 22:48
(1) it seems to return the input string unchanged, so pointless, but I'm not sure "doesn't work" is justified (2) added encoding argument to unicode() built-in in my answer — wberry, Jun 20 '11 at 13:52

score 0 · Answer 3 · answered Jun 17 '11 at 21:24

Simple solution. I think

tokens = [re.split(r'\b', line.strip()) for line in input if line != '\n'] #remove blank lines

becomes

tokens = [line.strip() for line in input if line != '\n']

then I am able to go with no need for str() or unicode() As far as I can tell.

if tokens[i].isupper(): #do stuff

The word token and the re.split on word boundaries is legacy of when I was messing with nltk earlier this week. But ultimately I am processing lines, not tokens/words. This may change. but for now this seems to work. I will leave this question open for now, in the hope of alternative solutions and comments.

python check if utf-8 string is uppercase

3 Answers3

Linked