Web-scraping with beautifulsoup in different siblings

Question

I've tried different ways to scrape Answer1 and Answer2 from a website through BeautifulSoup, urllib and Selenium, but without success. Here's the simplified version:

<div class="div1">
  <p class="p1"></p>
  <p class="p2">
    <span>Question1</span>
    <strong>Answer1</strong>
    <br>
    <span>Question2</span>
    <strong>Answer2</strong>
    <br>

In selenium, I try to find Question1, then go to its parent and scrape Answer1. Below is the code I use, although it's not correct.

browser.find_elements_by_xpath("//span[contains(text(), 'Question1')]/parent::p/following::strong")

I believe bs is more efficient than selenium in this case. How would you do this in bs? Thanks!

Edit: @Juan's solution is perfect for my example. However, I realized it's inapplicable to the website https://finance.yahoo.com/quote/AAPL?p=AAPL . Can anyone shed some light on parsing Consumer Goods and Electronic Equipment from there? And would it be better to use urllib.requests instead? Thank you.

Seems your question is incorrect as you want to `try to find Question1, then go to its parent and scrape Answer1` but your code attempt you did the reverse `"//span[contains(text(), 'Question1')]/parent::p/following::strong"` — undetected Selenium, Mar 21 '18 at 08:34
@DebanjanB My end goal is to scrape `Answer1` and `Answer2`, and I think the most reliable way to correctly scrape both of these is to refer to `Question1`. Therefore, I search `Question1`, then go back to its parent and find both Answers. My logic should be correct, but not sure about my code. — Karma, Mar 21 '18 at 20:48

score 1 · Answer 1 · answered Mar 20 '18 at 23:11

1

This is how I would do it. I modified your html closing the tags p and div:

from bs4 import BeautifulSoup as BS
html = """
<div class="div1">
  <p class="p1"></p>
  <p class="p2">
    <span>Question1</span>
    <strong>Answer1</strong>
    <br>
    <span>Question2</span>
    <strong>Answer2</strong>
    <br>
    </p>
</div>
"""
soup = BS(html,'lxml')
QA = {x.text:y.text for x,y in zip(soup.select('span'),soup.select('strong'))}
print(QA)

answered Mar 20 '18 at 23:11

jjsantoso

1,586
1
12
17

Thanks Juan, but it returns `{}`. Any idea? – Karma Mar 20 '18 at 23:20
In my computer returns: `{'Question1': 'Answer1', 'Question2': 'Answer2'}` – jjsantoso Mar 20 '18 at 23:32
Sorry, you're right. I was just trying on the website and it returned nothing. – Karma Mar 20 '18 at 23:34
You need to pass the html of the website. I use requests to get it: `import requests`; `url = 'http://www.example.com'`; `html = requests.get(url).text`; And then parse with BS. If the page have other span or strong tags, then you have to delimit. In this case could be: `QA = {x.text:y.text for x,y in zip(soup.select_one('div[class="div1"]').select('span'),soup.select_one('div[class="div1"]').select('strong'))}` – jjsantoso Mar 21 '18 at 00:03
It returns `AttributeError: 'NoneType' object has no attribute 'select'`, so it seems that there's no such Class in the website according to the other thread? https://stackoverflow.com/questions/8949252/python-attribute-error-nonetype-object-has-no-attribute-something/8949265#8949265 – Karma Mar 21 '18 at 00:20
The page you want to scrape is a little bit more complex that a simple static html. You have to be very specific identifying where is located the information you need. – jjsantoso Mar 21 '18 at 14:41

Nelson morrison hidalgo · Answer 2 · 2018-03-21T02:51:48.780

-1

div class="div1">
Question1 Answer1
Question2 Answer2

You only have to import and do it that with requests and beautifulsoup

Import request
From bs4 import BeautifulSoup
Url ="google.com"
R = requests.get(url)
Soup = BeautifulSoup(url, "lxml")
 For link in links:
    Soup.find_all("span")
    Print(link.text())
For answers in answer:
    Soup.find_all("strong")
    Print(answes.text)

And that my friend doing a membership check and a tuple that how you can do it.

edited Mar 21 '18 at 02:51

answered Mar 21 '18 at 01:37

Nelson morrison hidalgo

1
2

It didn't solve my problem in the "Edit" part, but I appreciate your help regardless. – Karma Mar 21 '18 at 03:05

Web-scraping with beautifulsoup in different siblings

2 Answers2