Questions tagged [lxml]

lxml is a full-featured, high performance Python library for processing XML and HTML.

Questions that concern the lxml Python library should have this tag. Per the XML website, "The lxml XML toolkit is a Pythonic binding for the C libraries libxml2 and libxslt." The library's lxml.etree package is used for XML processing. lxml's BeautifulSoup package parses broken HTML. html5lib uses the HTML5 parsing algorithm.

Links:

https://lxml.de/ - Contains API documentation and tutorials

https://www.ibm.com/developerworks/xml/library/x-hiperfparse/ - IBM developerWorks page on lxml

5412 questions
2
votes
1 answer

lxml xpath can not handle

tag

How to get p tag text "Blahblah" in this situation : when p tag text field is behind a strong tag, it can not be recognized by lxml.

ccBlahblah

====code==== from lxml import html content="""
babayetu
  • 77
  • 1
  • 5
2
votes
0 answers

__init__() got an unexpected keyword argument 'convertEntities'

I'm getting the error in title when trying to parse a HTML with soupparser - external interface to the BeautifulSoup HTML parser. This is my code: from lxml.html.soupparser import fromstring fromstring(""); Also, since I'm…
Tommz
  • 3,393
  • 7
  • 32
  • 44
2
votes
1 answer

Scraping IMDb Review Page with lxml and requests package

I want to extract the user reviews of a particular movie with help of lxml. Before that, I need to find out the number of reviews first. An example review page is Interstellar I found the XPath where User Reviews are found with the help of Firebug:…
GokuShanth
  • 203
  • 3
  • 12
2
votes
2 answers

python lxml.html.parse not reading url

Why is html.parse(url) failing, when using requests then html.fromstring works and html.parse(url2) works? lxml 3.4.2 Python 2.7.9 (default, Dec 10 2014, 12:28:03) [MSC v.1500 64 bit (AMD64)] on win32 Type "copyright", "credits" or "license()"…
foosion
  • 7,619
  • 25
  • 65
  • 102
2
votes
1 answer

Scraping multiple urls in parallel and inserting lxml element in queue

I am parsing multiple pages at once using lxml module with this piece of code def read_and_parse_url(url, queue): """ Read and parse the url """ data = urllib2.urlopen(url).read() root = lxml.html.fromstring(data) …
Thiago
  • 652
  • 1
  • 9
  • 29
2
votes
1 answer

how to write the opening of an xml doc in lxml?

I'm using lxml to write out a cXML file, but I can't figure out how to get it to write out the opening along with the doctype following it. When I started this, I started straight in on the document itself,…
Bendustries
  • 71
  • 1
  • 3
  • 10
2
votes
1 answer

Help with parsing lxml

To implement a college project, I need to handle XML files. For this I choose lxml after doing some research. However I can't seem to find some nice tutorial to help me get started. I can't choose most specifically which type of parsing I need to…
user225312
  • 126,773
  • 69
  • 172
  • 181
2
votes
1 answer

lxml unicode entity parse problems

I'm using lxml as follows to parse an exported XML file from another system: xmldoc = open(filename) etree.parse(xmldoc) But im getting: lxml.etree.XMLSyntaxError: Entity 'eacute' not defined, line 4495, column 46 Obviously it's having…
Jon Hadley
  • 5,196
  • 8
  • 41
  • 65
2
votes
1 answer

How do I require that an element has either one set of attributes or another in an XSD schema?

I'm working with an XML document where a tag must either have one set of attributes or another. For example, it needs to either look like or e.g.
Eli Courtwright
  • 186,300
  • 67
  • 213
  • 256
2
votes
0 answers

Combining tail and pretty_print in lxml

As soon as I modify the tail of an element (default is None), writing with pretty_print deletes all indentation. Everything is on a single line. Combining pretty_print and tail is not possible ? Example: from lxml import etree as et root =…
Eric H.
  • 2,152
  • 4
  • 22
  • 34
2
votes
2 answers

Regular expression works normally, but fails when placed in an XML schema

I have a simple doc.xml file which contains a single root element with a Timestamp attribute: I'd like to validate this document against a my simple schema.xsd to…
Eli Courtwright
  • 186,300
  • 67
  • 213
  • 256
2
votes
1 answer

Should Python 2.6 on OS X deal with multiple easy-install.pth files in $PYTHONPATH?

I am running ipython from sage and also am using some packages that aren't in sage (lxml, argparse) which are installed in my home directory. I have therefore ended up with a $PYTHONPATH of $HOME/sage/local/lib/python:$HOME/lib/python Python is…
ahd
  • 21
  • 2
2
votes
3 answers

Output of lxml in Python 2.7

This might be a completely foolish question, but google is to no avail. First of course importing the libraries I need: from lxml import html from lxml import etree import requests Simple enough. Now to run and parse some code. The link in this…
Ruhpun
  • 25
  • 3
2
votes
1 answer

How do I do thread-safe python XML validation?

Using Python 3.3, I need to validate XML documents against their DTDs or XSDs, and I expect to validate many documents against each specification. I will have a multi-threaded application performing the validation. lxml documentation explains how…
2
votes
3 answers

Python - Requests: Correctly Using Params?

Before I begin, may I just say, I am very new to general communication with the web in code. With that said, could anyone assist me in getting these parameters, 'a': stMonth, 'b': stDate, 'c': stYear, 'd': enMonth, …
The Novice
  • 124
  • 9