Questions tagged [lxml]

lxml is a full-featured, high performance Python library for processing XML and HTML.

Questions that concern the lxml Python library should have this tag. Per the XML website, "The lxml XML toolkit is a Pythonic binding for the C libraries libxml2 and libxslt." The library's lxml.etree package is used for XML processing. lxml's BeautifulSoup package parses broken HTML. html5lib uses the HTML5 parsing algorithm.

Links:

https://lxml.de/ - Contains API documentation and tutorials

https://www.ibm.com/developerworks/xml/library/x-hiperfparse/ - IBM developerWorks page on lxml

5412 questions
2
votes
2 answers

CxFreeze is not recognizing certain imports

After building my executable with CX_Freeze and trying to run the .exe I get this error. I understand this means CxFreeze is not recognizing lxml. However I have tried to include this in my setup.py Traceback(most recent call last): File…
sudobangbang
  • 1,406
  • 10
  • 32
  • 55
2
votes
1 answer

Getting XML attribute value with lxml module

How can i get the value of an attribute of XML file with lxml module? My XML looks like this" somename 0.456 0.4 …
Pythonizer
  • 1,080
  • 4
  • 15
  • 25
2
votes
3 answers

Best way to get back to using the power of lxml after having to use a regex to find something in an html document

I am trying to rip some text out of a large number of html documents (numbers in the hundreds of thousands). The documents are really forms but they are prepared by a very large group of different organizations so there is significant variation in…
PyNEwbie
  • 4,882
  • 4
  • 38
  • 86
2
votes
2 answers

How to replace an HTML tag with text inside an lxml iterwalk loop

I'm iterating through an HTML tree with lxml iterwalk and I'd like to replace all
tags inside
 with new line characters. That's what I have so far:
root = lxml.html.fromstring(text)
for action, el in etree.iterwalk(root):
    if…
Simon Steinberger
  • 6,605
  • 5
  • 55
  • 97
2
votes
2 answers

Get XPath of an element in DOM tree?

I'm using lxml implementation in python for HTML and XML parsing. Setting up a parser like parser = lxml.etree.HTMLParser() and returning a tree from HTML source (string) tree = lxml.etree.fromstring(html, parser).getroottree() # Returns a XML…
2
votes
1 answer

lxml: Append 'None' or Null value when html tag text content is None

Trying to read a html content and extract the last table's content to an array using lxml. Here is my last table: …
Nijin Narayanan
  • 2,269
  • 2
  • 27
  • 46
2
votes
1 answer

How do I scrape an https page?

I'm using a python script with 'lxml' and 'requests' to scrape a web page. My goal is to grab an element from a page and download it, but the content is on an HTTPS page and I'm getting an error when trying to access the stuff in the page. I'm sure…
kevingduck
  • 531
  • 1
  • 6
  • 21
2
votes
0 answers

XML indentation set to 4 spaces

I'm using the following code to indent, as mentioned here: parser = etree.XMLParser(remove_blank_text=True) tree = etree.parse(filename, parser) However, the original XML file is indented with 4 spaces and after using the code above it indents to 2…
bulkmoustache
  • 1,875
  • 3
  • 20
  • 24
2
votes
1 answer

Printing out messages from a lxml error log in UTF-8 format

I learn python (2.7 version) and i have task to check the xml document by xsd schema using lxml library (http://lxml.de/). I have two files - examples like these: $ cat 1.xml
dmgl
  • 267
  • 5
  • 12
2
votes
2 answers

Crawling tables from webpage

I'm trying to extract csu employee salary data from this webpage (http://www.sacbee.com/statepay/#req=employee%2Fsearch%2Fname%3D%2Fyear%3D2013%2Fdepartment%3DCSU%20Sacramento). I've tried using urlib2 and requests library, but none of them returned…
jinlong
  • 839
  • 1
  • 9
  • 19
2
votes
1 answer

lxml, add SubElement to SubElement

I've created an XML that looks like this. false
ErikSorensen
  • 215
  • 1
  • 16
2
votes
2 answers

Issue with parsing list of HTML with lxml and requests

I have a list of URLs stored in a variable href. When I pass it through the below function, the only returned recipe_links come from the first URL in href. Are there any glaring errors with my code? I'm not sure why it wouldn't loop through all 20…
metersk
  • 11,803
  • 21
  • 63
  • 100
2
votes
2 answers

Unicode: Python / lxml file output not as expected (print vs write)

I'm parsing an xml file using the code below: import lxml file_name = input('Enter the file name, including .xml extension: ') print('Parsing ' + file_name) from lxml import etree parser = lxml.etree.XMLParser() tree =…
Nick
  • 141
  • 11
2
votes
2 answers

Removing all children tags past a specific depth

Take some rudimentary HTML like this as an example. How could one remove all children nodes past say 2 nodes deep before it truncates and removes it.
ATMA
  • 129
  • 2
  • 6
2
votes
3 answers

python xml xpath query using tag and attribute with ns

I must be doing something inherently wrong here, every example I've seen and search for on SO seems to suggest this would work. I'm trying to use an XPath search with lxml etree library to parse a garmin tcx file:
kikixx
  • 161
  • 8
1 2 3
99
100

T1

T2