Questions tagged [lxml]

lxml is a full-featured, high performance Python library for processing XML and HTML.

Questions that concern the lxml Python library should have this tag. Per the XML website, "The lxml XML toolkit is a Pythonic binding for the C libraries libxml2 and libxslt." The library's lxml.etree package is used for XML processing. lxml's BeautifulSoup package parses broken HTML. html5lib uses the HTML5 parsing algorithm.

Links:

https://lxml.de/ - Contains API documentation and tutorials

https://www.ibm.com/developerworks/xml/library/x-hiperfparse/ - IBM developerWorks page on lxml

5412 questions
2
votes
3 answers

Unable to install Python Scrapy (Lxml) on Windows

I was trying to install Python Scrapy library but when it's trying to install Lxml library, this error appears: Requirement already up-to-date: pip in c:\python34\lib\site-packages Collecting lxml Using cached lxml-3.4.4.tar.gz Complete output…
Alejandra
  • 21
  • 3
2
votes
0 answers

ET.parse(filePath) is slower when run through different Python than one in Visual Studio

I run a python script (Python 2.7) that is computationally heavy and does some xml input/output operations. When I tested its elements in Python Interactive in VS they ran 2x faster (or more) than when I packed it into the main routine and ran it by…
jakub
  • 151
  • 2
  • 5
2
votes
1 answer

Lxml : Ampersand in text

I have a problem using lxml I am using lxml to parse an xml file and again write it back to a new xml file. Input file: " example text " " example text "
2
votes
1 answer

How do I get all content between two html tags in Python?

I try to extract all content (tags and text) from one main tag on html page. For example: `my_html_page = '''
Some text
Dim
  • 93
  • 1
  • 12
2
votes
3 answers

Using "if" as argument identifier

I want to generate the following xml file: I've tried this: from lxml import etree etree.Element("foo", if="bar") But I got this error: page = etree.Element("configuration", if="ok") …
2
votes
2 answers

lxml.etree iterparse() and parsing element completely

I have an XML file with nodes that looks like this: 41.3681107 3.9598 I am using lxml.etree.iterparse() to iteratively parse…
Andreas
  • 83
  • 2
  • 8
2
votes
1 answer

"lxml.etree.XPathEvalError: Invalid expression" with Unicode element names

lxml nicely supports Unicode element names, as they are valid according to XML specification. But using Unicode in XPath produces an error: >>> import lxml.etree >>> e = lxml.etree.fromstring('
alexanderlukanin13
  • 4,577
  • 26
  • 29
2
votes
1 answer

Stop pyquery inserting spaces where there aren't any in source HTML?

I am trying to get some text from an element, using pyquery 1.2. There are no spaces in the displayed text, but pyquery is inserting spaces. Here is my code: from pyquery import PyQuery as pq html = '

Richard
  • 62,943
  • 126
  • 334
  • 542

2
votes
1 answer

How to fix lxml assertion error

I have an ubuntu machine running pythong.2.7.6. When I try using lxml, which has been installed using pip, I get the following error: Traceback (most recent call last): File "./export.py", line 44, in fetch_item root.append(elem) File…
David542
  • 104,438
  • 178
  • 489
  • 842
2
votes
2 answers

lxml doesn't get all text in element if text has
?

I am using lxml to parse web document, I want to get all the text in a

element, so I use the code as follow: from lxml import etree page = etree.HTML("

test1
test2

") print page.xpath("//p")[0].text # this just print…
roger
  • 9,063
  • 20
  • 72
  • 119
2
votes
3 answers

How to get textarea value with lxml python

With this python code i can get whole html source import mechanize import lxml.html import StringIO br = mechanize.Browser() br.set_handle_robots(False) br.addheaders = [("User-agent","Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.2.13)…
Dark Cyber
  • 2,181
  • 7
  • 44
  • 68
2
votes
2 answers

Python 3.4.0 -- 'ascii' codec can't encode characters in position 11-15: ordinal not in range(128) -- Unix 14.04

Trying to retrieve some data from the web using urlib and lxml, I've got an error and have no idea, how to fix it. url='http://sum.in.ua/?swrd=автор' page = urllib.request.urlopen(url) The error itself: UnicodeEncodeError: 'ascii' codec can't…
Khrystyna
  • 123
  • 2
  • 9
2
votes
3 answers

How to select text without the HTML markup

I am working on a web scraper (using Python), so I have a chunk of HTML from which I am trying to extract text. One of the snippets looks something like this:

This class has some text and a few

Yuka
  • 473
  • 2
  • 10
2
votes
1 answer

Using lxml to Validate HTML

I am trying to use lxml to validate a piece of HTML but it complains that the fragment is invalid even though it should be valid: img = """""" parser =…
Alex Rothberg
  • 10,243
  • 13
  • 60
  • 120
2
votes
1 answer

Creating an element with 'class' attribute throws a syntax error

When I try to do this with the lxml module: div = etree.SubElement(body, "div", class="hmi") I get a: user@localhost:metk $ sudo python mbscan.py -r 192.168.0.0/24 --hmi File "mbscan.py", line 481 div = etree.SubElement(body, "div",…
Juicy
  • 11,840
  • 35
  • 123
  • 212