0

I want to parse xml file in utf-8 and sort it by some field. Soring is made by custom alphabet (s1 from sourcecode). History of question is here: sorting of list containing utf-8 charachters. I've found how to sort xml here. Sorting work correctly, the problem is with elementtree, I must admit that it doesn't work on python3

Here is source code:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
#import xml.etree.ElementTree as ET   # Python 2.5
import elementtree.ElementTree as ET
s1='aáàAâÂbBcCçÇdDeéEfFgGğĞhHiİîÎíīıIjJkKlLmMnNóoOöÖpPqQrRsSşŞtTuUûúÛüÜvVwWxXyYzZ'
s2='11111122334455666aabbccddeeeeeeffgghhiijjkklllllmmnnooppqqrrsssssttuuvvwwxxyy'
trans = str.maketrans(s1, s2)
def unikey(seq):
    return seq[0].translate(trans)
tree = ET.parse("tosort.xml")
container = tree.find("entries")
data = []
for elem in container:
    keyd = elem.findtext("k")
    data.append((keyd, elem))
print (data)
data.sort(key=unikey)
print (data)
container[:] = [item[-1] for item in data]
tree.write("sorted.xml", encoding="utf-8")

Here are instructions to import elementtree module. When I import module this way :import xml.etree.ElementTree as ET, I get a message:

Traceback (most recent call last):
File "pcs.py", line 19, in <module>
container[:] = [item[-1] for item in data]
File "/usr/lib/python3.1/xml/etree/ElementTree.py", line 210, in __setitem__
assert iselement(element)
AssertionError

When I use this method to import: import elementtree.ElementTree as ET, I get this message:

Traceback (most recent call last):
File "pcs.py", line 4, in <module>
import elementtree.ElementTree as ET
File "/usr/local/lib/python3.1/dist-packages/elementtree/ElementTree.py", line 794, in <module>
_escape = re.compile(eval(r'u"[&<>\"\u0080-\uffff]+"'))
File "<string>", line 1
u"[&<>\"\u0080-\uffff]+"
                       ^
SyntaxError: invalid syntax

I use Python 3.1.3 (r313:86834, Nov 28 2010, 11:28:10). In python2.6 elementtree work without a problem.

Content of tosort.xml:

<xdxf>
<entries>
<ar><k>zaaaa</k>definition1</ar>
<ar><k>şaaaa</k>definition2</ar>
...
...
</entries>
</xdxf>
Community
  • 1
  • 1
microspace
  • 386
  • 1
  • 5
  • 18
  • The first code block has indentation problems inside `for`, could you fix that to match the actual code you run? – Lev Levitsky Jun 22 '12 at 18:13
  • Also, I think the problem could be that `s2` still contains non-ASCII characters, and those mess up the sorting. – Lev Levitsky Jun 22 '12 at 18:42
  • Oh, sorry. I've fixed that. second code with non-ASCII characters works well. I think that there is something wrong with inout file encoding, but I can't figure out. – microspace Jun 22 '12 at 20:05
  • I've managed to solve sorting problem. Thank you @Lev Levitsky. I've removed all non-ASCII characters from `s2` string. – microspace Jun 24 '12 at 09:06

2 Answers2

1

Looks like you import different modules, one in /usr/lib/python3.1 called xml.etree and the other in /usr/local/lib/python3.1/dist-packages called elementtree. The latter seems broken to me, as for the former, try to remove [:] in the line

 container[:] = [item[-1] for item in data]
Lev Levitsky
  • 63,701
  • 20
  • 147
  • 175
  • removing `[:]` didn't help. This line of code is from [example](http://effbot.org/zone/element-sort.htm). The `AssertionError`-module seems to work on python2.6. Maybe someone could tell how to make my string translation in python 2.6? Thank you! – microspace Jun 24 '12 at 15:21
  • @microspace If it didn't help, can you show how the traceback looks without `[:]`? – Lev Levitsky Jun 24 '12 at 15:25
  • I've edited the question. I made traceback with following command `print (traceback.format_exc())`. Is it correct? I've never printed traceback before... Without `[:]` sorted data just wasn't written to file... – microspace Jun 24 '12 at 15:49
  • @microspace You said removing `[:]` didn't help. With `[:]` there was an error on that line, and the interpreter printed the traceback (ending with `AsertionError`). What happens when you remove `[:]`? You don't need to print the traceback manually. – Lev Levitsky Jun 24 '12 at 15:53
  • without `[:]` programm ended without errors, sorting was accomplished, but to output file **unsorted** data was written. – microspace Jun 24 '12 at 15:58
  • @microspace That is because your code didn't change anything in `tree`. – Lev Levitsky Jun 24 '12 at 16:01
  • actually I don't realy understand how tree-manipulating code from effbot.com works (I just added sorting function to it), but code worked with python2.6. Indeed the solution to question is found, thank you for your help, @Lev Levitsky. I'll go to study how elementtree works. )) – microspace Jun 24 '12 at 16:17
  • @microspace Okay, `[:]` was important for that code to work (with it, you __would__ have changed the tree, apparently, although this seems weird to me). Now you are creating a new list and so nothing happens to `tree`. However, in Python3, somehow the elements returned by `find` don't pass the `iselement` assertion, I don't exactly know why. – Lev Levitsky Jun 24 '12 at 16:28
0

Don't punch me too much but, here is my variant of solution:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
import xml.etree.ElementTree as ET # Python 2.5
from xml.etree.ElementTree import Element
s1="áàaAâÂbBcCçÇdDeéEfFgGğĞhHiİîÎíīıIjJkKlLmMnNóoOöÖpPqQrRsSşŞtTuUûúÛüÜvVwWxXyYzZ"
s2="AAAAAABBCCCCDDEEEFFGGHHddeeeeeeffgghhiijjkklllllmmnnooppqqrrsssssttuuvvwwxxyy"
trans = str.maketrans(s1, s2)
def unikey(seq):
    return seq[0].translate(trans)
tree = ET.parse("tosort.xml")
container = tree.find("entries")
data = []
for elem in container:
    keyd = elem.findtext("k")
    data.append([keyd, elem])
data.sort(key=unikey)
root = tree.getroot()
i=0
for item in data:
    root.append(data[i][1]) # appends sorted Element objects to tree
    i=i+1
#container = [item[-1] for item in data]
root.remove(tree.find("entries")) # removes unsorted Element objects
tree.write("sorted.xml", encoding="utf-8")

Solution is a bit ugly, but it works... I don't know how much time will it take to sort ~50Mb of xml data, but time is not important in my case. Also I've changed sorting pattern a bit because it sorted wrong if there were numbers in words. On Acer extensa 5210 it took no more than 2 min to sort.

microspace
  • 386
  • 1
  • 5
  • 18