0

I am trying to process some data with lxml. It works fine on my development server, but on production the following code:

parser = etree.XMLParser(encoding='cp1251')

throws:

  File "parser.pxi", line 1288, in lxml.etree.XMLParser.__init__ (third_party/apphosting/python/lxml/src/lxml/lxml.etree.c:77726)
  File "parser.pxi", line 738, in lxml.etree._BaseParser.__init__ (third_party/apphosting/python/lxml/src/lxml/lxml.etree.c:73404)
LookupError: unknown encoding: 'cp1251'

I am using lxml 2.3. The same version seems to be supported by GAE. So why is this error?

Edit:

I specified different encodings to XMLParser, such as cp1252, ISO-8859-5, ISO-8859-2 and it always throwed the same error on GAE, but works on my local machine. These are popular encodings and lxml on GAE must support them. I believe this is something wrong with lxml build on GAE.

I created an issue: http://code.google.com/p/googleappengine/issues/detail?id=7315

Edit2:

Full traceback:

unknown encoding: 'cp1251'
Traceback (most recent call last):
  File "/base/python27_runtime/python27_lib/versions/third_party/webapp2-2.3/webapp2.py", line 1511, in __call__
    rv = self.handle_exception(request, response, e)
  File "/base/python27_runtime/python27_lib/versions/third_party/webapp2-2.3/webapp2.py", line 1505, in __call__
    rv = self.router.dispatch(request, response)
  File "/base/python27_runtime/python27_lib/versions/third_party/webapp2-2.3/webapp2.py", line 1253, in default_dispatcher
    return route.handler_adapter(request, response)
  File "/base/python27_runtime/python27_lib/versions/third_party/webapp2-2.3/webapp2.py", line 1077, in __call__
    return handler.dispatch()
  File "/base/python27_runtime/python27_lib/versions/third_party/webapp2-2.3/webapp2.py", line 547, in dispatch
    return self.handle_exception(e, self.app.debug)
  File "/base/python27_runtime/python27_lib/versions/third_party/webapp2-2.3/webapp2.py", line 545, in dispatch
    return method(*args, **kwargs)
  File "/base/data/home/apps/s~my_cool_app_id/1.358126884781269352/main.py", line 29, in get
    parser = etree.XMLParser(encoding='cp1251')
  File "parser.pxi", line 1288, in lxml.etree.XMLParser.__init__ (third_party/apphosting/python/lxml/src/lxml/lxml.etree.c:77726)
  File "parser.pxi", line 738, in lxml.etree._BaseParser.__init__ (third_party/apphosting/python/lxml/src/lxml/lxml.etree.c:73404)
LookupError: unknown encoding: 'cp1251'
Maxim
  • 1,783
  • 2
  • 14
  • 24
  • This works for me on shell-27.appspot.com (once I take care not to let the parser get pickled). Are you using Python 2.7? Can you include the complete stacktrace? – Nick Johnson Apr 11 '12 at 04:19
  • Also, have you tried decoding the text before passing it to the XML parser? – Nick Johnson Apr 11 '12 at 04:19
  • @NickJohnson thanks for your comments! Yes, I am using Python 2.7. On shell-27.appspot.com I can't even import lxml (No module named lxml). I am able to decode text with `my_xml_str.decode('cp1251')`, but this is suboptimal, because before passing such unicode string to `etree.XML()` for parsing, I have to manually remove encoding declarations in the xml document itself. Secondly, the xml document is 5Mb in size, and this is slower than allowing lxml to decode string on its own. I edited my question to include the full traceback. – Maxim Apr 11 '12 at 06:47

1 Answers1

1

There seems to be a bug open about this behavior on OS X where specifying encoding="cp1252" resulted in the error above. The comments also specify other systems as affected: https://bugs.launchpad.net/lxml/+bug/707396

Have you tried specifying other encoding types? (to see if it's just a problem with cp1252)

Sologoub
  • 5,312
  • 6
  • 37
  • 65
  • Thank you for your reply. I am actually using cp1251, not cp1252... But I tested my code with cp1252 encoding too, and it still throws the same errors. On the other hand it works if I specify utf-8 encoding. It seems that it is something wrong with lxml build on GAE servers, because it works perfectly on my local machine. – Maxim Apr 10 '12 at 07:48
  • Sorry about that. Can you provide the details of your environment? I'm still thinking it might be related to the bug reported. – Sologoub Apr 10 '12 at 16:09
  • found this post that shows people successfully decoding cp1251 using lxml on GAE: http://stackoverflow.com/questions/9793086/python-unicode-behaviour-in-google-app-engine. So the problem seems to be centered around encoding only. – Sologoub Apr 10 '12 at 16:21