Questions tagged [lxml]

lxml is a full-featured, high performance Python library for processing XML and HTML.

Questions that concern the lxml Python library should have this tag. Per the XML website, "The lxml XML toolkit is a Pythonic binding for the C libraries libxml2 and libxslt." The library's lxml.etree package is used for XML processing. lxml's BeautifulSoup package parses broken HTML. html5lib uses the HTML5 parsing algorithm.

Links:

https://lxml.de/ - Contains API documentation and tutorials

https://www.ibm.com/developerworks/xml/library/x-hiperfparse/ - IBM developerWorks page on lxml

5412 questions
32
votes
5 answers

Equivalent to InnerHTML when using lxml.html to parse HTML

I'm working on a script using lxml.html to parse web pages. I have done a fair bit of BeautifulSoup in my time but am now experimenting with lxml due to its speed. I would like to know what the most sensible way in the library is to do the…
somewhatoff
  • 971
  • 1
  • 11
  • 25
32
votes
6 answers

Python pretty XML printer with lxml

After reading from an existing file with 'ugly' XML and doing some modifications, pretty printing doesn't work. I've tried etree.write(FILE_NAME, pretty_print=True). I have the following XML:
prosseek
  • 182,215
  • 215
  • 566
  • 871
32
votes
4 answers

Setup.py: install lxml with Python2.6 on CentOS

I have installed Python 2.6.6 on CentOS 5.4, [@SC-055 lxml-2.3beta1]$ python Python 2.6.6 (r266:84292, Jan 4 2011, 09:49:55) [GCC 4.1.2 20080704 (Red Hat 4.1.2-46)] on linux2 Type "help", "copyright", "credits" or "license" for more…
k99
  • 709
  • 2
  • 6
  • 13
32
votes
2 answers

BeautifulSoup: what's the difference between 'lxml' and 'html.parser' and 'html5lib' parsers?

When using Beautiful Soup what is the difference between 'lxml' and "html.parser" and "html5lib"? When would you use one over the other and the benefits of each? When I used each they seemed to be interchangeable, but people here correct me that I…
duc hathaway
  • 417
  • 1
  • 4
  • 9
32
votes
6 answers

Best way for a beginner to learn screen scraping by Python

This might be one of those questions that are difficult to answer, but here goes: I don't consider my self programmer - but I would like to :-) I've learned R, because I was sick and tired of spss, and because a friend introduced me to the language…
Andreas
  • 6,612
  • 14
  • 59
  • 69
32
votes
8 answers

How to install lxml on Windows

I'm trying to install lmxl on my Windows 8.1 laptop with Python 3.4 and failing miserably. First off, I tried the simple and obvious solution: pip install lxml. However, this didn't work. Here's what it said: Downloading/unpacking lxml Running…
spelchekr
  • 933
  • 3
  • 11
  • 19
31
votes
1 answer

out of memory issue in installing packages on Ubuntu server

I am using a Ubuntu cloud server with limited 512MB RAM and 20 GB HDD. Its 450MB+ RAM is already used by processes. I need to install a new package called lxml which gets complied using Cpython while installation and its a very heavy process so it…
Man8Blue
  • 1,187
  • 6
  • 21
  • 34
29
votes
1 answer

How can I view a text representation of an lxml element?

If I'm parsing an XML document using lxml, is it possible to view a text representation of an element? I tried to do : print repr(node) but this outputs What can I use to see the node like it exists in the XML file? Is…
Geo
  • 93,257
  • 117
  • 344
  • 520
27
votes
2 answers

Py2exe lxml woes

I have a wxpython application that depends on lxml and works well when running it through the python interpreter. However, when creating an exe with py2exe, I got this error ImportError: No module named _elementpath I then used python setup.py…
jack the lesser
  • 701
  • 2
  • 7
  • 15
26
votes
5 answers

How to find the number of elements in element tree in python?

I am new to element tree,here i am trying to find the number of elements in the element tree. from lxml import etree root = etree.parse(open("file.xml",'r')) is there any way to find the total count of the elements in root?
mariz
  • 509
  • 1
  • 7
  • 13
26
votes
4 answers

Installing lxml, libxml2, libxslt on Windows 8.1

After additional exploration, I found a solution to installing lxml with pip and wheel. Additional comments on approach welcomed. I'm finding the existing Python documentation for Linux distributions excellent. For Windows... not so much. I've…
SigmaXD
  • 921
  • 1
  • 7
  • 11
26
votes
3 answers

How to parse broken HTML with LXML

I'm trying to parse a broken HTML with LXML parser on python 2.5 and 2.7 Unlike in LXML documentation (http://lxml.de/parsing.html#parsing-html) parsing a broken HTML does not work: from lxml import etree import StringIO broken_html =…
diemacht
  • 2,022
  • 7
  • 30
  • 44
25
votes
5 answers

easy_install lxml on Python 2.7 on Windows

I'm using python 2.7 on Windows. How come the following error occurs when I try to install [lxml][1] using [setuptools][2]'s easy_install? C:\>easy_install lxml Searching for lxml Reading http://pypi.python.org/simple/lxml/ Reading…
Jonathan Livni
  • 101,334
  • 104
  • 266
  • 359
25
votes
5 answers

How to use regular expression in lxml xpath?

I'm using construction like this: doc = parse(url).getroot() links = doc.xpath("//a[text()='some text']") But I need to select all links which have text beginning with "some text", so I'm wondering is there any way to use regexp here? Didn't find…
Arty
  • 5,923
  • 9
  • 39
  • 44
24
votes
4 answers

How should I deal with an XMLSyntaxError in Python's lxml while parsing a large XML file?

I'm trying to parse an XML file that's over 2GB with Python's lxml library. Unfortunately, the XML file does not have a line telling the character encoding, so I have to manually set it. While iterating through the file though, there are still some…
damon
  • 14,485
  • 14
  • 56
  • 75