0

I am using Google App Engine on Python and am trying to fetch a GZipped XML file and parse it with LXML's iterparse. I used the example from lxml.de to create the following code:

import gzip, base64, StringIO
from lxml import etree
from google.appengine.ext import webapp
from google.appengine.api.urlfetch import fetch

class Catalog(webapp.RequestHandler):
user = xxx
password = yyy
catalog = fetch('url',
                    headers={"Authorization": 
                             "Basic %s" % base64.b64encode(user + ':' + password)}, deadline=600)
items = etree.iterparse(StringIO.StringIO(catalog), tag='product')

for _, element in items:
    print('%s -- %s' % (element.findtext('name'), element[1].text))
    element.clear()

When I run it it gives me the following error:

for _, element in coupons:
File "iterparse.pxi", line 491, in lxml.etree.iterparse.__next__ (src/lxml\lxml.etree.c:98565)
File "iterparse.pxi", line 543, in lxml.etree.iterparse._read_more_events (src/lxml\lxml.etree.c:99086)
File "parser.pxi", line 590, in lxml.etree._raiseParseError (src/lxml\lxml.etree.c:74791)
XMLSyntaxError: Specification mandate value for attribute object, line 1, column 53

What does this error mean? I am guessing that the XML file is malformed, however I don't know where to look for the problem. Any help would be appreciated!

Vincent
  • 1,137
  • 18
  • 40
  • Didn't you say the xml was gzipped? Where are you decompressing it? Also, the error message says the error is on line 1, column 53. – Sebastian Kreft Feb 03 '13 at 19:06
  • As far as I am aware, LXML takes care of GZip automatically, which is why I am not unzipping it. Now I read that it is in line 1, colum 53, but what does this mean? The products in the XML file have 21 tags each, so logically I would expect it to be the 11th column of the 3rd product, but nothing is wrong there. In general the file is pretty well formed so I am suspecting the problem to be something else. – Vincent Feb 04 '13 at 06:35
  • The line items = etree.iterparse(StringIO.StringIO(catalog), tag='product'), should read items = etree.iterparse(StringIO.StringIO(catalog.content), tag='product'). – Sebastian Kreft Feb 04 '13 at 17:28
  • Thanks for your suggestion. I tried this but it returns "XMLSyntaxError: Document is empty, line 1, column 1". I would give you the URL but it has credentials which I cannot share. What is the most common way to share a document like this GZip file on Stackoverflow? – Vincent Feb 05 '13 at 07:47
  • So, if you are getting a document is empty error, it probably means that the fetch call is not returning the contents of the file. First make sure, the return code is successful and the content is correct. See https://developers.google.com/appengine/docs/python/urlfetch/responseobjects for details. – Sebastian Kreft Feb 05 '13 at 17:50

1 Answers1

2

The problem was solved by handling the fetch/gzip part differently, activating asynchronous requests, and using webapp2. When using all of this it worked :) Here's the code:

from google.appengine.api.urlfetch import fetch
import gzip, webapp2, base64, StringIO, datetime
from credentials import CJCredentials
from lxml import etree

class Catalog(webapp2.RequestHandler):
def get(self):
    user = xxx
    password = yyy
    url = 'some_url'

    catalogResponse = fetch(url, headers={
        "Authorization": "Basic %s" % base64.b64encode(user + ':' + password)
    }, deadline=10000000)

    f = StringIO.StringIO(catalogResponse.content)
    c = gzip.GzipFile(fileobj=f)
    content = c.read()

    xml = StringIO.StringIO(content)

    tree = etree.iterparse(xml, tag='product')

    for event, element in tree:
       print element.name
Vincent
  • 1,137
  • 18
  • 40