Questions tagged [lxml.html]

lxml.html is a dedicated python package for dealing with HTML.

lxml.html is a dedicated python package for dealing with HTML. It is based on lxml's HTML parser, but provides a special Element API for HTML elements, as well as a number of utilities for common HTML processing tasks.

159 questions
2
votes
0 answers

Python, robobrowser, answer authentication-challenge after login

I'm really new to python programming. I'm working on automation of a web-browser. I started with selenium, but found it to be really slow for what I need. I'm working on a code that can Login to a webpage and fill out few text-boxes and click on…
shiny
  • 171
  • 4
  • 13
2
votes
1 answer

Python: lxml xpath to extract content

Below code able to extract PE from the reuters link below. However, my method is not robust as the webpage for another stock has two lines lesser and result a shift of data. How can I encounter this problem. I would like to point straight to the…
vindex
  • 331
  • 6
  • 17
2
votes
1 answer

Python lxml iterating through tr elements

I'm running into an issue when trying to get the parent node of a tr element whilst iterating through them all. Here's a basic table that I'm working with.

Some text

Chad
  • 35
  • 1
  • 8
2
votes
2 answers

How to grab raw all raw html within a certain XPath from a local file in Python

I am trying to grab the raw html from a bunch of local html files. I had some help from this post in getting the raw file to read in: Get all text inside a tag lxml But the code I have currently produces the entire file instead of a subset. Right…
Paul Loach
  • 25
  • 4
2
votes
2 answers

Attempting to get the text from a certain part of a website using lxml.html

I have some current Python code that is supposed to get the HTML from a certain part of a website, using the xpath of where the HTML tag is located. def wordorigins(word): pageopen =…
2
votes
1 answer

HTML parsing with lxml, python, .tail being broken up by
tags

I have a website that I am trying to scrape (while not really understanding html) but I have done a ton of reading and made some progress. It's a messy site but the important part looks like this:

DESCRIPTOR1: " important…

2
votes
1 answer

HTML parsing with lxml - how to keep empty content in resulting list?

I am using lxml to parse an html file: from lxml import html tree = html.parse(myfile) data = tree.xpath('//p/text()') I have 300

text

tags in my html file, but len(data) is only 250 because sometimes I'll have

in my html. I want…
user1566200
  • 1,826
  • 4
  • 27
  • 47
2
votes
2 answers

This xPath is giving no results, any reason why?

import requests from lxml import html page = requests.get(url="http://www.cia.gov/library/publications/the-world-factbook/geos/ch.html") tree = html.fromstring(page.content) bordering =…
2
votes
2 answers

Python LXML.HMTL Xpath Return Empty List

Problem: The date_list is an empty list. Should not be empty because list length should equal list length of oct and filing_type_list. What I have done: searched for typos. tried different companies (example is of REXAHN PHARMACEUTICALS,…
SAH
  • 269
  • 3
  • 8
2
votes
0 answers

Python lxml xpath gives different results on two different unix distros

When I run this xpath expression //tr[42]/td//span/./following-sibling::a[1]/@href on two different systems, I get two different results. On Ubuntu 14.04.2 LTS i get ["javascript:__doPostBack('datagrid_results$_ctl44$_ctl1','')"] On rehel fedora…
Fuchida
  • 428
  • 6
  • 16
2
votes
1 answer

Web scraping a text() in python

I am having trouble with a web scraping function. The XPath for the two things I am trying to get are /html/body/div/table[2]/tbody/tr[5]/td[1]/div[1]/ul/li[1]/text() /html/body/div/table[2]/tbody/tr[5]/td[1]/div[1]/ul/li[1]/a The html is
  • lost
    • 377
    • 2
    • 14
  • 2
    votes
    1 answer

    How to parse a htmlpage with lxml with
    screwing up?

    I want to parse the following piece of html from Nasa's website with lxml in python:

    Launch Date:1981-09-24
    Launch Vehicle: Delta
    Launch Site: Cape…

    Frank
    • 99
    • 1
    • 6
    2
    votes
    1 answer

    lxml.html ignoring body class attributes

    I am using lxml.html for parsing html content. But I don't understand why lxml is dropping "body" tag attributes. Tried using both lxml.html.parse and lxml.html.document_fromstring as suggested here But still it is not working. Example html…
    Karan
    • 46
    • 3
    2
    votes
    3 answers

    How to get textarea value with lxml python

    With this python code i can get whole html source import mechanize import lxml.html import StringIO br = mechanize.Browser() br.set_handle_robots(False) br.addheaders = [("User-agent","Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.2.13)…
    Dark Cyber
    • 2,181
    • 7
    • 44
    • 68
    2
    votes
    1 answer

    Using lxml to Validate HTML

    I am trying to use lxml to validate a piece of HTML but it complains that the fragment is invalid even though it should be valid: img = """""" parser =…
    Alex Rothberg
    • 10,243
    • 13
    • 60
    • 120
    1 2
    3
    10 11