Questions tagged [lxml]

lxml is a full-featured, high performance Python library for processing XML and HTML.

Questions that concern the lxml Python library should have this tag. Per the XML website, "The lxml XML toolkit is a Pythonic binding for the C libraries libxml2 and libxslt." The library's lxml.etree package is used for XML processing. lxml's BeautifulSoup package parses broken HTML. html5lib uses the HTML5 parsing algorithm.

Links:

https://lxml.de/ - Contains API documentation and tutorials

https://www.ibm.com/developerworks/xml/library/x-hiperfparse/ - IBM developerWorks page on lxml

5412 questions
21
votes
2 answers

Beautiful Soup and Table Scraping - lxml vs html parser

I'm trying to extract the HTML code of a table from a webpage using BeautifulSoup. ...
I would like to know why the code bellow works with the "html.parser" and prints back none if I change…
LaGuille
  • 1,658
  • 5
  • 20
  • 37
21
votes
1 answer

HTML scraping using lxml and requests gives a unicode error

I'm trying to use HTML scraper like the one provided here. It works fine for the example they provided. However, when I try using it with my webpage, I receive this error - Unicode strings with encoding declaration are not supported. Please use…
user3783999
  • 571
  • 2
  • 7
  • 17
21
votes
1 answer

How can I preserve
as newlines with lxml.html text_content() or equivalent?

I want to preserve
tags as \n when extracting the text content from lxml elements. Example code: fragment = '
This is a text node.
This is another text node.

And a child element.Another child,
with two…
extempo
  • 213
  • 2
  • 6
20
votes
7 answers

How can I parse HTML with html5lib, and query the parsed HTML with XPath?

I am trying to use html5lib to parse an html page in to something I can query with xpath. html5lib has close to zero documentation and I've spent too much time trying to figure this problem out. Ultimate goal is to pull out the second row of a…
Dan.StackOverflow
  • 1,279
  • 4
  • 18
  • 28
20
votes
3 answers

Parsing broken XML with lxml.etree.iterparse

I'm trying to parse a huge xml file with lxml in a memory efficient manner (ie streaming lazily from disk instead of loading the whole file in memory). Unfortunately, the file contains some bad ascii characters that break the default parser. The…
erikcw
  • 10,787
  • 15
  • 58
  • 75
19
votes
3 answers

Creating a doctype with lxml's etree

I want to add doctypes to my XML documents that I'm generating with LXML's etree. However I cannot figure out how to add a doctype. Hardcoding and concating the string is not an option. I was expecting something along the lines of how PI's are…
Marijn
  • 288
  • 1
  • 2
  • 6
19
votes
3 answers

using lxml and iterparse() to parse a big (+- 1Gb) XML file

I have to parse a 1Gb XML file with a structure such as below and extract the text within the tags "Author" and "Content": MM/DD/YY Last Name, Name Lorem ipsum…
mvime
  • 327
  • 1
  • 2
  • 8
19
votes
2 answers

How to find XML Elements via XPath in Python in a namespace-agnostic way?

since I had this annoying issue for the 2nd time, I thought that asking would help. Sometimes I have to get Elements from XML documents, but the ways to do this are awkward. I’d like to know a python library that does what I want, a elegant way to…
flying sheep
  • 8,475
  • 5
  • 56
  • 73
19
votes
8 answers

lxml.etree, element.text doesn't return the entire text from an element

I scrapped some html via xpath, that I then converted into an etree. Something similar to this: text1 link text2 but when I call element.text, I only get text1 (It must be there, when I check my query in FireBug, the text of…
user522034
  • 221
  • 1
  • 3
  • 5
19
votes
2 answers

Python lxml Subelement with text value?

Is it possible to somehow create element with default text value? So I would not need to do it like this? from lxml import etree root = etree.Element('root') a = etree.SubElement(root, 'a') a.text = 'some text' # Avoid this extra step? I mean you…
Andrius
  • 19,658
  • 37
  • 143
  • 243
19
votes
2 answers

How to add a namespace to an attribute in lxml

I'm trying to create an xml entry that looks like this using python and lxml: I'm using python and lxml. I'm having trouble with the adlcp:scormtype attribute. I'm new to xml so please correct…
Mateo
  • 1,781
  • 1
  • 16
  • 21
19
votes
1 answer

Parse SGML with Open Arbitrary Tags in Python 3

I am trying to parse a file such as: http://www.sec.gov/Archives/edgar/data/1409896/000118143112051484/0001181431-12-051484.hdr.sgml I am using Python 3 and have been unable to find a solution with existing libraries to parse an SGML file with open…
borncamp
  • 300
  • 2
  • 5
18
votes
2 answers

How to write namespaced element attributes with LXML?

I'm using lxml (2.2.8) to create and write out some XML (specifically XGMML). The app which will be reading it is apparently fairly fussy and wants to see a top level element with:
timday
  • 24,582
  • 12
  • 83
  • 135
18
votes
1 answer

Extracting lxml xpath for html table

I have a html doc similar to following:
mkt2012
  • 251
  • 1
  • 3
  • 8
18
votes
1 answer

Pylint Error Message: "E1101: Module 'lxml.etree' has no 'strip_tags' member'"

I am experimenting with lxml and python for the first time for a personal project, and I am attempting to strip tags from a bit of source code using etree.strip_tags(). For some reason, I keep getting the error message: "E1101: Module 'lxml.etree'…
Aaron Viscichini
  • 257
  • 3
  • 13
CodeName