Questions tagged [lxml.html]

lxml.html is a dedicated python package for dealing with HTML.

lxml.html is a dedicated python package for dealing with HTML. It is based on lxml's HTML parser, but provides a special Element API for HTML elements, as well as a number of utilities for common HTML processing tasks.

159 questions
2
votes
1 answer

Scraping IMDb Review Page with lxml and requests package

I want to extract the user reviews of a particular movie with help of lxml. Before that, I need to find out the number of reviews first. An example review page is Interstellar I found the XPath where User Reviews are found with the help of Firebug:…
GokuShanth
  • 203
  • 3
  • 12
2
votes
3 answers

Output of lxml in Python 2.7

This might be a completely foolish question, but google is to no avail. First of course importing the libraries I need: from lxml import html from lxml import etree import requests Simple enough. Now to run and parse some code. The link in this…
Ruhpun
  • 25
  • 3
2
votes
3 answers

Python - Requests: Correctly Using Params?

Before I begin, may I just say, I am very new to general communication with the web in code. With that said, could anyone assist me in getting these parameters, 'a': stMonth, 'b': stDate, 'c': stYear, 'd': enMonth, …
The Novice
  • 124
  • 9
2
votes
1 answer

Getting parent tag id with lxml

I am trying scrape a dummy site and get the parent tag of one that I am searching for. Heres the structure of the code I am searching for:
Heres my python…
user2157179
  • 238
  • 2
  • 4
  • 19
2
votes
1 answer

Duplicates when extracting data from html table using lxmk.html.xpath()

I am trying to extract data from this table at Espn cricinfo. Each row is comprised of the folowing format (Data replaced by headers): Player Name (Country) Score …
2
votes
1 answer

Removing img tag in lxml

I have this code: from lxml.html import fromstring, tostring html = "

Here is some text

" doc = fromstring(html) img = doc.find('.//img') doc.remove(img) print tostring(doc) And the output is:

Why does…
rmacqueen
  • 971
  • 2
  • 8
  • 22
2
votes
2 answers

Traversing back to parent with lxml.html.xpath

How can we traverse back to parent in xpath? I am crawling IMDB, to obtain genre of films, I am using elem = hxs.xpath('//*[@id="titleStoryLine"]/div/h4[text()="Genres:"]') Now,the genres are listed as anchor links, which are siblings to this tag.…
Amrith Krishna
  • 2,768
  • 3
  • 31
  • 65
2
votes
1 answer

python - parse html form with lxml.html with xpath syntax

Here is the form. The same exact form appears twice in the source.
user3196332
  • 361
  • 2
  • 4
  • 11
2
votes
1 answer

Python - lxml library 'clean' method erasing only half of empty
  • node
  • I'm using the lxml library in Python to clean html pages from potentially harmful code/parts I don't want. I noticed a strange behavior in the function: when given an empty
  • node, it removes the closing
  • tag but not the opening one. For…
    Robin
    • 9,415
    • 3
    • 34
    • 45
    1
    vote
    1 answer

    Best XPath practices for extracting data from a field that varies in format

    I was using Python 3.8, XPath and Scrapy where things just seemed to work. I took my XPath expressions for granted. Now I'm must using Python 3.8, XPath and lxml.html and things are much less forgiving. For example, using this URL and this…
    spacedog
    • 446
    • 3
    • 13
    1
    vote
    2 answers

    Why does python requests.get() retrieve different image src compared to browsing the site

    As the title suggest: calling the requests.get() method gives me a different image src link as opposed to when browsing the site manually. I'm trying to scrape a site for products and want to store the images but the src I get from the site is for a…
    Marco Fernandes
    • 326
    • 1
    • 4
    • 13
    1
    vote
    1 answer

    How to get text from HTML element by using lxml.html

    I've been trying to get a full text hosted inside a
    element from the web page https://www.list-org.com/company/11665809. The element should contain a sub-string "Арбитраж". And it does, because my code for div in…
    Sergey Solod
    • 695
    • 7
    • 15
    1
    vote
    2 answers

    Scraping a nested and unstructured table in python (lxml)

    The website I'm scraping (using lxml ) is working just fine with everything except a table, in which all the tr's , td's and heading th's are nested & mixed and forms a unstructured HTML table.
    Serial No. …
    Mukul Kumar Jha
    • 1,062
    • 7
    • 19
    1
    vote
    2 answers

    Python scraping's trouble in extract value

    I'm trying to extract values from the table in this site: https://www.geonames.org/search.html?q=&country=IT In my example I want to extract the name 'Rome' and I used this code: import requests import lxml.html html =…
    gergiu
    • 11
    • 1
    1
    vote
    1 answer

    Compare string result from path & requests

    I am scraping the HTML code from the URL defined, mainly focussing on the tag, to extract the results of it. Then, compare if string "example" exists in the script, if yes, print something and flag =1. I am not able to compare the results extracted…
    tehais
    • 33
    • 6