Python 3: How to web scrape text from div that contains multiple class values

Question

I'm trying to web scrape a website (Here is the link to website), but the div in the page seems to have multiple class attributes which is making me hard to scrape the data. I tried to look for historical questions posted on Stackoverflow, but could not find an answer that I wanted. The below is part of the code I extracted from the website:

<div data-reactid="118">
  <div class="ue-ga base_ ue-jk" style="margin-left:-24px;margin-bottom:;" data-reactid="119">
    <div style="display: flex; flex-direction: column; width: 100%; padding-left: 24px;" data-reactid="120">
      <div class="ue-a3 ue-ap ue-a6 ue-gb ue-ah ue-n ue-f5 ue-ec ue-gc ue-gd ue-ge ue-gf base_ ue-jv ue-gz ue-h0 ue-h1" data-reactid="121">
        <div class="ue-a6 ue-bz ue-gb ue-ah ue-gg ue-gh ue-gi" data-reactid="122">
          <div class="ue-bn ue-bo ue-cc ue-bq ue-g9 ue-bs" title="Want to extract this part" data-reactid="123">
            Want to extract this part
          </div>
        </div>
      </div>
    </div>
  </div>
</div>

What I want to extract is the text where it states "Want to extract this part". I did think of scraping the data through data-reactid, but different pages have different data-reactid number assigned so wasn't a good idea. I also want to inform that class names are not unique.

Can anyone guide me through this? Much appreciated.

What is around that element? Is it in a table or ? Can you post a link to the page or post more of the surrounding HTML? — JeffC, Sep 06 '18 at 04:38
@JeffC I have updated the link to the website thanks for helping out — DanLee, Sep 06 '18 at 08:32

score 1 · Answer 1 · edited Sep 06 '18 at 07:00

1

you can use jQuery as below.

$("div[title=Want to extract this part]").text();

edited Sep 06 '18 at 07:00

TIGER

2,864
5
35
45

answered Sep 06 '18 at 01:21

mala

11
1

score 1 · Answer 2 · answered Sep 06 '18 at 01:24

1

If the classes always remain the same for that specific element on each page you can target it with this selector:

.ue-bn.ue-bo.ue-cc.ue-bq.ue-g9.ue-bs

However, there are many other selectors you could use but it all depends on if they are unique and consistent across pages.

answered Sep 06 '18 at 01:24

itodd

2,278
1
14
14

Thanks for your input. Are you referring to css selector? Classes are not unique so I guess it is going to be harder. – DanLee Sep 06 '18 at 01:33
Yes a CSS selector that you can use in JS such as `document.querySelector('.ue-bn.ue-bo.ue-cc.ue-bq.ue-g9.ue-bs');` It's hard to figure out a unique selector without seeing the whole page html – itodd Sep 06 '18 at 01:39
Which parts of the web page are you trying to scrape exactly? If it is the meal names you can use this selector: `div > div[title]`. e.g. `document.querySelectorAll('div > div[title]').forEach(el => console.log(el.title))` – itodd Sep 07 '18 at 04:08

score 1 · Answer 3 · answered Sep 06 '18 at 05:43

This may help you

from bs4 import BeautifulSoup
html = """<div data-reactid="118">
<div class="ue-ga base_ ue-jk" style="margin-left:-24px;margin-bottom:;" data-reactid="119">
<div style="display: flex; flex-direction: column; width: 100%; padding-left: 24px;" data-reactid="120">
  <div class="ue-a3 ue-ap ue-a6 ue-gb ue-ah ue-n ue-f5 ue-ec ue-gc ue-gd ue-ge ue-gf base_ ue-jv ue-gz ue-h0 ue-h1" data-reactid="121">
    <div class="ue-a6 ue-bz ue-gb ue-ah ue-gg ue-gh ue-gi" data-reactid="122">
      <div class="ue-bn ue-bo ue-cc ue-bq ue-g9 ue-bs" title="Want to extract this part" data-reactid="123">
        Want to extract this part
      </div>
    </div>
  </div>
</div>
</div>
</div>"""

soup = BeautifulSoup(html,'html.parser')
tag = soup.find('div', attrs={'class':'ue-bn'})
text = (''.join(tag.stripped_strings))
print (text)

Sers · Answer 4 · 2018-09-06T11:56:49.607

Menus:

- all menus to use in loop, css selector: div.base_ h3
- menu by name, xpath: //div[contains(@class,'base_')]//h3[.='Big Mac® Bundles']

Food Cards

- titles, css selector: div[title]
- titles, xpath: //div[./div[@title]]/div[@title]
- prices, xpath: //div[./div[@title]]//span
If you want to loop:

cards = driver.find_elements_by_xpath("//div[./div[@title]]")
for card in cards:
     title = card.find_element_by_css_selector("div[title]")
     price = card.find_element_by_css_selector("span")
     #or using xpath
     #title = card.find_element_by_xpath("./div[@title]")
     #price = card.find_element_by_xpath(".//span")

Category menu:

- all categories, css selector: a[href*='category']

score 0 · Answer 5 · answered Sep 06 '18 at 07:20

As per the HTML you have shared to extract the text Want to extract this part as the element is a React element you have to induce WebDriverWait for the element to be visible and you can use either of the following solution:

Using title attribute:

myText = WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "div.base_ div[title]"))).get_attribute("title")

Using innerHTML:

myText = WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "div.base_ div[title]"))).get_attribute("innerHTML")

Note : You have to add the following imports :

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC

Why did you change your name? and why are you using the new contributor icon as your icon? Don't you think that might be misleading to others? — JeffC, Sep 06 '18 at 12:50

Python 3: How to web scrape text from div that contains multiple class values

5 Answers5