Questions tagged [lxml]

lxml is a full-featured, high performance Python library for processing XML and HTML.

Questions that concern the lxml Python library should have this tag. Per the XML website, "The lxml XML toolkit is a Pythonic binding for the C libraries libxml2 and libxslt." The library's lxml.etree package is used for XML processing. lxml's BeautifulSoup package parses broken HTML. html5lib uses the HTML5 parsing algorithm.

Links:

https://lxml.de/ - Contains API documentation and tutorials

https://www.ibm.com/developerworks/xml/library/x-hiperfparse/ - IBM developerWorks page on lxml

5412 questions
2
votes
2 answers

Why etree.find doesn't find the element for the provided example

Lets suppose it has this: xml_as_str = ''' Foo Bar foo@bar.com ''' from lxml import etree tree = etree.fromstring(xml_as_str, etree.XMLParser(recover=True)) How could it…
trinchet
  • 6,753
  • 4
  • 37
  • 60
2
votes
1 answer

Extracting all cities in Wikipedia

http://en.wikipedia.org/wiki/List_of_cities_in_China I want to extract all city names as shown below: I use the following code (for only extract one field), where xpath is copy from chrome from lxml import html import requests page =…
william007
  • 17,375
  • 25
  • 118
  • 194
2
votes
3 answers

Add / update elements at position using lxml python

I have a situation where I want to add a particular element at the position and update if there is already present at the given position. Ex: ?
Vimalraj Selvam
  • 2,155
  • 3
  • 23
  • 52
2
votes
2 answers

Scraping paginated sites and appending output in Python

I have a simple scraping task that I would like to improve the pagination efficiency of, and append lists so that I may output the results of scraping to a common/single file. The current task is scraping municipal laws for the city of São Paulo,…
DV Hughes
  • 305
  • 2
  • 5
  • 22
2
votes
3 answers

Finding inline style with lxml.cssselector

New to this library (no more familiar with BeautifulSoup either, sadly), trying to do something very simple (search by inline style): blah blah I just want to select all tds where style="padding: 20px", but I can't…
ropa
  • 23
  • 3
2
votes
1 answer

Get text next to selected element in lxml / Python

I have the following HTML markup and I'd like to get the English description as plain text out of this snippet - without the "English, and without any tags": from lxml import etree html = '''

English:

Simon Steinberger
  • 6,605
  • 5
  • 55
  • 97
2
votes
1 answer

gcc Internal error on lxml installation CentOS

I am having some trouble installing lxml on CentOS-6. I have tried the solutions of some similar questions like, pip install lxml error or Setup.py: install lxml with Python2.6 on CentOS but these did not work. How to install it correctly? after…
salmanwahed
  • 9,450
  • 7
  • 32
  • 55
2
votes
1 answer

How to modify XML as text in lxml

I have an XML file generated by an IDE; however, it unfortunately outputs code with newlines as BRs and seems to randomly decide where to place newlines. Example: if test = true foo; bar; endif becomes the following XTML within an XML…
user1601333
  • 151
  • 1
  • 10
2
votes
1 answer

How can I get the text with xPath between and

?

I have the HTML code and I want to parse string that starts with "Pour all ingredients" with xPath. I have already done the trick with span and li objects. But this text is not belonged to anything. How should I write the xpath? EG for li: for…
alex
  • 31
  • 7
2
votes
2 answers

Extracting the value by xpath in python between tags

I want to extract parameter that I referred in the picture below... What I have tried is: url='http://site.ir' content=requests.get(url).content tree = html.fromstring(content) print [e.text_content() for e in…
MLSC
  • 5,872
  • 8
  • 55
  • 89
2
votes
1 answer

lxml etree and xpath returning an encoded image rather than URL for src

I want the src url of an image when I process some html, but I am getting back an encoded image. What am I doing wrong if I want the url? Given a url like: "http://www.amazon.com/Cheese-Plate-multi-purpose-mounting-plate/dp/B00CI06DWE/" And a…
dolphinkickme
  • 73
  • 1
  • 8
2
votes
2 answers

Get attributes and text from Xpath query as a list

I would like to query an html string and extract the href attribute and the text node from an hyperlink into a list (or any other dictionary). Consider the following code: from lxml import html str = ' Text1 ' \ '
madflow
  • 7,718
  • 3
  • 39
  • 54
2
votes
2 answers

integration of python into excel using pyxll... having problems with lxml module

I am new to python. I am trying to get the meaning of a word from internet. The standalone python code works just fine. from lxml import html import requests url = "http://dictionnaire.reverso.net/francais-definition/" word =…
2
votes
4 answers

How to convert XPath Element to plain html text?

I have page: And I want to get element '//div/a' as plain html text. text_url How can I do it?
Anton Barycheuski
  • 712
  • 2
  • 9
  • 21
2
votes
2 answers

Python - Parse HTML class

I have tried in anger to parse the following representative HTML extract, using BeautifulSoup and lxml: [

Abacus Trust Company Limited
Sixty Circular Road
DOUGLAS
ISLE…

Chris Finlayson
  • 341
  • 4
  • 13