0

I've tried different ways to scrape Answer1 and Answer2 from a website through BeautifulSoup, urllib and Selenium, but without success. Here's the simplified version:

<div class="div1">
  <p class="p1"></p>
  <p class="p2">
    <span>Question1</span>
    <strong>Answer1</strong>
    <br>
    <span>Question2</span>
    <strong>Answer2</strong>
    <br>

In selenium, I try to find Question1, then go to its parent and scrape Answer1. Below is the code I use, although it's not correct.

browser.find_elements_by_xpath("//span[contains(text(), 'Question1')]/parent::p/following::strong")

I believe bs is more efficient than selenium in this case. How would you do this in bs? Thanks!

Edit: @Juan's solution is perfect for my example. However, I realized it's inapplicable to the website https://finance.yahoo.com/quote/AAPL?p=AAPL . Can anyone shed some light on parsing Consumer Goods and Electronic Equipment from there? And would it be better to use urllib.requests instead? Thank you.

Karma
  • 249
  • 1
  • 4
  • 10
  • Seems your question is incorrect as you want to `try to find Question1, then go to its parent and scrape Answer1` but your code attempt you did the reverse `"//span[contains(text(), 'Question1')]/parent::p/following::strong"` – undetected Selenium Mar 21 '18 at 08:34
  • @DebanjanB My end goal is to scrape `Answer1` and `Answer2`, and I think the most reliable way to correctly scrape both of these is to refer to `Question1`. Therefore, I search `Question1`, then go back to its parent and find both Answers. My logic should be correct, but not sure about my code. – Karma Mar 21 '18 at 20:48

2 Answers2

1

This is how I would do it. I modified your html closing the tags p and div:

from bs4 import BeautifulSoup as BS
html = """
<div class="div1">
  <p class="p1"></p>
  <p class="p2">
    <span>Question1</span>
    <strong>Answer1</strong>
    <br>
    <span>Question2</span>
    <strong>Answer2</strong>
    <br>
    </p>
</div>
"""
soup = BS(html,'lxml')
QA = {x.text:y.text for x,y in zip(soup.select('span'),soup.select('strong'))}
print(QA)
jjsantoso
  • 1,586
  • 1
  • 12
  • 17
  • Thanks Juan, but it returns `{}`. Any idea? – Karma Mar 20 '18 at 23:20
  • In my computer returns: `{'Question1': 'Answer1', 'Question2': 'Answer2'}` – jjsantoso Mar 20 '18 at 23:32
  • Sorry, you're right. I was just trying on the website and it returned nothing. – Karma Mar 20 '18 at 23:34
  • You need to pass the html of the website. I use requests to get it: `import requests`; `url = 'http://www.example.com'`; `html = requests.get(url).text`; And then parse with BS. If the page have other span or strong tags, then you have to delimit. In this case could be: `QA = {x.text:y.text for x,y in zip(soup.select_one('div[class="div1"]').select('span'),soup.select_one('div[class="div1"]').select('strong'))}` – jjsantoso Mar 21 '18 at 00:03
  • It returns `AttributeError: 'NoneType' object has no attribute 'select'`, so it seems that there's no such Class in the website according to the other thread? https://stackoverflow.com/questions/8949252/python-attribute-error-nonetype-object-has-no-attribute-something/8949265#8949265 – Karma Mar 21 '18 at 00:20
  • The page you want to scrape is a little bit more complex that a simple static html. You have to be very specific identifying where is located the information you need. – jjsantoso Mar 21 '18 at 14:41
-1

div class="div1">

Question1 Answer1
Question2 Answer2

You only have to import and do it that with requests and beautifulsoup

Import request
From bs4 import BeautifulSoup
Url ="google.com"
R = requests.get(url)
Soup = BeautifulSoup(url, "lxml")
 For link in links:
    Soup.find_all("span")
    Print(link.text())
For answers in answer:
    Soup.find_all("strong")
    Print(answes.text)

And that my friend doing a membership check and a tuple that how you can do it.