Extracting data from webpage using lxml XPath in Python

Question

I am having some unknown trouble when using xpath to retrieve text from an HTML page from lxml library.

The page url is www.mangapanda.com/one-piece/1/1

I want to extract the selected chapter name text from the drop down select tag. Now I just want the first option so the XPath to find that is pretty easy. That is :-

.//*[@id='chapterMenu']/option[1]/text()

I verified the above using Firepath and it gives correct data. but when I am trying to use lxml for the purpose I get not data at all.

from lxml import html
import requests

r = requests.get("http://www.mangapanda.com/one-piece/1/1")
page = html.fromstring(r.text)

name = page.xpath(".//*[@id='chapterMenu']/option[1]/text()")

But in name nothing is stored. I even tried other XPath's like :-

//div/select[@id='chapterMenu']/option[1]/text()
//select[@id='chapterMenu']/option[1]/text()

The above were also verified using FirePath. I am unable to figure out what could be the problem. I would request some assistance regarding this problem.

But it is not that all aren't working. An xpath that working with lxml xpath here is :-

.//img[@id='img']/@src

Thank you.

score 1 · Accepted Answer · edited May 23 '17 at 12:06

I've had a look at the html source of that page and the content of the element with the id chapterMenu is empty. I think your problem is that it is filled using javascript and javascript will not be automatically evaluated just by reading the html with lxml.html

You might want to have a look at this: Evaluate javascript on a local html file (without browser)

Maybe you're able to trick it though... In the end, also javascript needs to fetch the information using a get request. In this case it requests: http://www.mangapanda.com/actions/selector/?id=103&which=191919

Which is json and can be easily turned into a python dict/array using the json library. But you have to find out how to get the id and the which parameter if you want to automate this.

The id is part of the html, look for document['mangaid'] within one of the script tags and which ~~can maybe stay 191919~~ has to be 0... ~~although I couldn't find it in any source~~ I found it, when it is 0 you will be redirected to the proper url.

So there you go ;)

Thanks I think this will work. I found out that we can even use QtWebkit which renders JS code. I tried using it and the html data I received contained all the chapters name. But the problem being that it takes more time to load. — Psycho_Coder, Mar 12 '15 at 18:27

score 0 · Answer 2 · answered Mar 12 '15 at 17:16

0

The source document of the page you are requesting is in a default namespace:

<html xmlns="http://www.w3.org/1999/xhtml">

even if Firepath does not tell you about this. The proper way to deal with namespaces is to redeclare them in your code, which means associating them with a prefix and then prefixing element names in XPath expressions.

name = page.xpath('//*[@id='chapterMenu']/xhtml:option[1]/text()',
   namespaces={'xhtml': 'http://www.w3.org/1999/xhtml'})

Then, the piece of the document the path expression above is concerned with is:

<select id="chapterMenu" name="chapterMenu"></select>

As you can see, there is no option element inside it. Please tell us what exactly you'd like to find.

answered Mar 12 '15 at 17:16

Mathias Müller

22,203
13
58
75

I want to get the Chapter Name which we can see in the dropdown list. That is for the link I provided it will be "Chapter 1: Romance Dawn" – Psycho_Coder Mar 12 '15 at 17:42
@Psycho_Coder Then, swenzel is quite right. This content is _not_ present in the source document. You can try this yourself by just opening the source and searching for e.g. "Romance". – Mathias Müller Mar 12 '15 at 17:46
Yes I saw that and @swenzel is quite correct. I wonder what I can do to get the chapter names. I making a MangaScrapper and I wanted it to be cool and simple. I thought there would exist better ways to do so rather than using a webdriver. One thing that I can do to meet my need is that from [this](http://www.mangapanda.com/alphabetical) page I get the url for Manga home and from there I extract the chapter URL. But that would require me to code extra things. – Psycho_Coder Mar 12 '15 at 17:55
I've updated my answer... it might solve you problem now if you don't run into further js problems :D – swenzel Mar 12 '15 at 18:11

Extracting data from webpage using lxml XPath in Python

2 Answers2