I am having trouble with .isupper() when I have a utf-8 encoded string. I have a lot of text files I am converting to xml. While the text is very variable the format is static. words in all caps should be wrapped in <title>
tags and everything else <p>
. It is considerably more complex then this, but this should be sufficent for my question.
My problem is that this is an utf-8 file. This is a must, as there will be some many non-English characters in the final output. This may be time to provide a brief example:
inputText.txt
RÉSUMÉ
Bacon ipsum dolor sit amet strip steak t-bone chicken, irure ground round nostrud aute pancetta ham hock incididunt aliqua. Dolore short loin ex chicken, chuck drumstick ut hamburger ut andouille. In laborum eiusmod short loin, spare ribs enim ball tip sausage. Tenderloin ut consequat flank. Tempor officia sirloin duis. In pancetta do, ut dolore t-bone sint pork pariatur dolore chicken exercitation. Nostrud ribeye tail, ut ullamco venison mollit pork chop proident consectetur fugiat reprehenderit officia ut tri-tip.
DesiredOutput
<title>RÉSUMÉ</title>
<p>Bacon ipsum dolor sit amet strip steak t-bone chicken, irure ground round nostrud
aute pancetta ham hock incididunt aliqua. Dolore short loin ex chicken, chuck drumstick
ut hamburger ut andouille. In laborum eiusmod short loin, spare ribs enim ball tip sausage.
Tenderloin ut consequat flank. Tempor officia sirloin duis. In pancetta do, ut dolore t-bone
sint pork pariatur dolore chicken exercitation. Nostrud ribeye tail, ut ullamco venison
mollit pork chop proident consectetur fugiat reprehenderit officia ut tri-tip.
</p>
Sample Code
#!/usr/local/bin/python2.7
# yes this is an alt-install of python
import codecs
import sys
import re
from xml.dom.minidom import Document
def main():
fn = sys.argv[1]
input = codecs.open(fn, 'r', 'utf-8')
output = codecs.open('desiredOut.xml', 'w', 'utf-8')
doc = Documents()
doc = parseInput(input,doc)
print>>output, doc.toprettyxml(indent=' ',encoding='UTF-8')
def parseInput(input, doc):
tokens = [re.split(r'\b', line.strip()) for line in input if line != '\n'] #remove blank lines
for i in range(len(tokens)):
# THIS IS MY PROBLEM. .isupper() is never true.
if str(tokens[i]).isupper():
title = doc.createElement('title')
tText = str(tokens[i]).strip('[\']')
titleText = doc.createTextNode(tText.title())
doc.appendChild(title)
title.appendChild(titleText)
else:
p = doc.createElement('p')
pText = str(tokens[i]).strip('[\']')
paraText = doc.createTextNode(pText)
doc.appendChild(p)
p.appenedChild(paraText)
return doc
if __name__ == '__main__':
main()
ultimately it is pretty straight forward, I would accept critiques or suggestions on my code. Who wouldn't? In particular I am unhappy with str(tokens[i])
perhaps there is a better way to loop through a list of strings?
But the purpose of this question is to figure out the most efficient way to check if an utf-8 string is capitalized. Perhaps I should look into crafting a regex for this.
Do note, I did not run this code and it may not run just right. I hand picked the parts from working code and may have mistyped something. Alert me and I will correct it. lastly, note I am not using lxml