Questions tagged [lxml.html]

lxml.html is a dedicated python package for dealing with HTML.

lxml.html is a dedicated python package for dealing with HTML. It is based on lxml's HTML parser, but provides a special Element API for HTML elements, as well as a number of utilities for common HTML processing tasks.

159 questions
1
vote
1 answer

How to extract paragraph text in python using lxml from html file?

I am trying to extract the paragraph but getting []instead of the paragraph. How can I extract the paragraph? Selector_1 = "div.bloco-imovel-texto p" tree.cssselect(Selector_1)
sargupta
  • 953
  • 13
  • 25
1
vote
0 answers

cut XML tree at a specific depth

I have xmlfiles like this one: This file…
dada
  • 1,390
  • 2
  • 17
  • 40
1
vote
1 answer

how to extract href from element using lxml cssselctor?

def extract_page_data(html): tree = lxml.html.fromstring(html) item_sel = CSSSelector('.my-item') text_sel = CSSSelector('.my-text-content') time_sel = CSSSelector('.time') author_sel = CSSSelector('.author-text') a_tag = CSSSelector('.a') for…
1
vote
3 answers

KeyError in python saying KeyError : 'value'

I am trying to get the hidden elements in twitter login page. I followed a procedure which simply gets the hidden elements in that page. But the problem is when i try to get value of those elements, i am getting key error. the code is: import…
Akhil Reddy
  • 371
  • 1
  • 6
  • 26
1
vote
1 answer

Convert element to css selector in python

I'm trying to convert the following element: @[width="300"] That I convert to xpath as: //*[@width="300"] To a css selector. Because with lxml if I run: selector = "@[width="300"]" tree =…
J0ker98
  • 447
  • 5
  • 18
1
vote
1 answer

How to get data by selecting a value from a drop-down option without using selenium

I need to fetch all URLs from this page - http://www.questdiagnostics.com/testcenter/BUSearch.action?submitValue=BUSearch&keyword=Toxoplasma+Abs+IgG+%2F+IgM whenever I am selecting a value from a drop down and click on go button. I selected a value…
1
vote
3 answers

Unable to remove spaces between scraped text

I've written a script in python to scrape some text out of some html elements. The script can parse it now. However, the problem is the results look weird with bunch of spaces between them. How can I fix it? Any help will be highly appreciated. This…
SIM
  • 21,997
  • 5
  • 37
  • 109
1
vote
1 answer

How to get concatenated child text nodes in lxml

This is the HTML sample:

First text part

Andersson
  • 51,635
  • 17
  • 77
  • 129
1
vote
1 answer

can't get value inside tag in lxml

I am using lxml to scrape data from a website. The html code snippet is
1
vote
1 answer

Select and modify xpath nodes after specific text

I use this code to get all names: def parse_authors(self, root): author_nodes = root.xpath('//a[@class="booklink"][contains(@href,"/author/")]/text()') if author_nodes: return [unicode(author) for author in author_nodes] But i…
wrangly
  • 37
  • 5
1
vote
1 answer

How check if element exist in lxml xpath?

I use lxml xpath for parsing HTML page in Python 3. As sample I have code, that finds element HTML: version_android = doc.xpath("//div[@itemprop='operatingSystems']//text()") Father I have insert Mysql query: insert = ("insert into tracks…
Huligan
  • 419
  • 2
  • 6
  • 18
1
vote
1 answer

lxml removes double slash iframe

I'm using lxml to sanitize html data, but in some cases lxml is removing also the valid tags. It removes iframe tags that have a valid host but starts with double slashes (//) code example: >>> cleaner =…
user3164429
  • 140
  • 1
  • 10
1
vote
1 answer

Why does lxml.html sometimes swallow/remove whitespace instead of preserving it?

Given the following code, one might reasonably expect almost the exact same string of HTML that was fed into lxml to be to spit back out. from lxml import html HTML_TEST_STRING =…
naki
  • 932
  • 6
  • 11
1
vote
1 answer

proper xpath to roll up text of children

I'm parsing a page that has structure like this:
content a
content b
# returns content a content b And I'm using the following XPath to get the content: "//pre[@class='asdf']/text()" It works well,…
tedder42
  • 23,519
  • 13
  • 86
  • 102
1
vote
1 answer

Python parsing html with lxml: get text of tag while specific sign causes problems

I'm parsing Real-World HTML files with lxml. This means, I want to extract information from tags and I don't have the control of the style. The problem I'm having lies within the data.
Notes
IssnKissn
  • 81
  • 1
  • 1
  • 6