1

I am trying to scrape an image from a marketplace, but I think that the strange class tags are getting in the way. This is the piece of HTML that I am trying to scrape:

HTML

When I run this snippet:

import requests
from bs4 import BeautifulSoup
url = 'https://www.americanas.com.br/produto/134231584?pfm_carac=Aspirador%20de%20P%C3%B3%20Vertical&pfm_page=category&pfm_pos=grid&pfm_type=vit_product_grid&voltagem=110V'

headers = {'User-Agent': 'whatever'}
response = requests.get(url, headers=headers)
html = response.content
bs = BeautifulSoup(html, "lxml")
bs.find('div', class_='src__Wrapper-xr9q25-1 fwzdjF')

I get this result: <div class="src__Wrapper-xr9q25-1 fwzdjF"></div>. No more content available to scrape.

If I try to scrape the picture tag nothing happens:

>>> bs.find('picture', class_="src__Picture-xr9q25-2 gKwsnn")

Does someone have a clue on what to do here?

dsenese
  • 143
  • 11
  • 1
    FYI ‘to scrap’ means to throw away. The correct word for what you’re doing is __scrape__ – DisappointedByUnaccountableMod Mar 09 '21 at 19:35
  • 1
    The picture is probably loaded dynamically by JavaScript. That's why it's not there when you scrape - the page JS has not run. – forgetso Mar 09 '21 at 19:38
  • Thank you for correcting, @barny. – dsenese Mar 09 '21 at 19:50
  • Is there a way to load the JS to get the image? @forgetso – dsenese Mar 09 '21 at 19:50
  • 1
    Yes, you could use Selelnium and chrome driver to render the page. Alternatively, you can identify the calls the JS is making by looking in the network tab in the browser inspector and by reading the page JS. Then replicate those calls in python. Hard to be clearer as I cannot access the website - denied. – forgetso Mar 09 '21 at 19:54
  • @forgetso Sorry about that, I think that it's blocked from people outside of Brazil. I'll try to use Selenium and Chrome Driver to achieve that. One more question, requests lib is still needed in that case? – dsenese Mar 09 '21 at 20:05
  • 1
    No, you don't need requests if you're using ChromeDriver - it's a headless browser. [Source](https://stackoverflow.com/questions/52217866/web-scraping-using-selenium-and-bs4) – forgetso Mar 09 '21 at 20:08

1 Answers1

1

The images are loaded dynamically via JavaScript, but you can use this example to get it with json and re modules:

import re
import json
import requests

url = 'https://www.americanas.com.br/produto/134231584?pfm_carac=Aspirador%20de%20P%C3%B3%20Vertical&pfm_page=category&pfm_pos=grid&pfm_type=vit_product_grid&voltagem=110V'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:86.0) Gecko/20100101 Firefox/86.0'}
data = json.loads( re.search(r'window\.__APOLLO_STATE__ = (.*)</script>', requests.get(url, headers=headers).text ).group(1) )


def find_images(data):
    if isinstance(data, dict):
        for k, v in data.items():
            if k == 'images':
                yield v
            else:
                yield from find_images(v)
    elif isinstance(data, list):
        for v in data:
            yield from find_images(v)


images = next(find_images(data))

for image in images:
    print(image['extraLarge'])

Prints:

https://images-americanas.b2w.io/produtos/01/00/img/134231/5/134231592_1SZ.jpg
https://images-americanas.b2w.io/produtos/01/00/img/134231/5/134231592_2SZ.jpg
https://images-americanas.b2w.io/produtos/01/00/img/134231/5/134231592_3SZ.jpg
Andrej Kesely
  • 168,389
  • 15
  • 48
  • 91
  • 1
    This worked flawlessly! Although Selenium/ChromeDriver would be a good approach, this code is rapid, worked with other products on this website and I can keep using Requests as I wanted. Thank you! – dsenese Mar 09 '21 at 20:35