html scraping with python and xpath

Question

I have tried to understand the process to using lxml to pull the text trying simple python program

from lxml import html
import requests
page = requests.get('http://www.foo bar')
tree = html.fromstring(page.content)
name = tree.xpath('//*[@id="yui_3_17_2_1_1487276887950_2408"]/div[@class="locu-menu-item-name"]/text')
print(name)

results in []

values for a nested tag, the xpath is: //*[@id="yui_3_17_2_1_1487276887950_103789"]/div[1]/div[1]

the value is <div class="locu-menu-item-name">Italian Lemon Sorbetto</div> which is nested like this

<div class="menu-item-inner">                      
    <div class="locu-menu-item-name">Italian Lemon Sorbetto</div>
    <div class="locu-menu-item-description">Dairy-free</div>
    <div class="option-wrapper"></div>
    <div class="locu-menu-item-price"></div>
</div>

any help would be great.

You'll probably find it easier to find references and help if you just used css selectors instead. — pvg, Feb 17 '17 at 22:56

Oleksandr Dashkov · Answer 1 · 2017-02-17T23:38:03.880

0

You have an error in your xpath. To get the text you should use /text() in the end and not /text. So your xpath should be like this:

name = tree.xpath('//*[@id="yui_3_17_2_1_1487276887950_2408"]/div[@class="locu-menu-item-name"]/text()')

When you use /text, it means that you are looking for the nested tag text.

<div class="locu-menu-item-name"><text>Italian Lemon Sorbetto</text></div>

edited Feb 17 '17 at 23:38

answered Feb 17 '17 at 23:13

Oleksandr Dashkov

2,249
1
15
29

I understand where the error in my xpath but I changed it from what inspector showed as the xpath when I copy xpath on this element it is: //*[@id="locu-medium-container"]/div[1]/div/div[1]/div[2]/div[1]/div[1]/div[1] , however using either my corrected /text() or the unaltered xpath still results in an empty result. I am really at a loss at where to pull this from. – Chuck LaPress Feb 18 '17 at 17:37
@ChuckLaPress Could you provide the real url? When I run the xpath 'tree.xpath('//div[@class="locu-menu-item-name"]/text()')' with html that you added I get the text – Oleksandr Dashkov Feb 18 '17 at 17:53
@ChuckLaPress so it's a problem with the page load. When you make a call using requests, you don't have your element in the response. Probably after the load of the page JS add the elements. You can use something like 'Selenium' to get this data – Oleksandr Dashkov Feb 18 '17 at 18:14
thanks, I understand that however I have very little experience, thank you for clarifying that the problem is the element hasn't loaded when it is trying to pull it. – Chuck LaPress Feb 18 '17 at 18:19
One other question I have written some code that is resulting in : using selenium could you possible help me get the answer by reviewing what I have written – Chuck LaPress Feb 18 '17 at 19:04
@ChuckLaPress I can help you if you have a problem. But you should check this url http://stackoverflow.com/questions/7781792/selenium-waitforelement , to wait for your elements and after use http://selenium-python.readthedocs.io/locating-elements.html to get your text – Oleksandr Dashkov Feb 18 '17 at 19:12
Thank you for the links I am thankful for your help already and the links will get me there. thanks again – Chuck LaPress Feb 18 '17 at 19:21
will post code with output still not getting desired string for locu-item-name will be in new topic named python selenium scraping – Chuck LaPress Feb 19 '17 at 02:50

html scraping with python and xpath

1 Answers1