0

If you go to the site, you'd notice that there is an age confirmation window which I want to bypass through scrapy but I messed up with that and I had to move on to selenium webdriver and now I'm using

driver.find_element_by_xpath('xpath').click()

to bypass that age confirmation window. Honestly I don't want to go with selenium webdriver because of its time consumption. Is there any way to bypass that window? I searched a lot in stackoverflow and google but didn't get any answer which may resolves my problem. If you've any link or idea of resolving it by Scrapy, that'd be appreciated. A single helpful comment will be up-voted!

Community
  • 1
  • 1

2 Answers2

2

But the age verification "window" is just a div that gets hidden when you press the button, not a real separate window:

<div class="age-check-modal" id="age-check-modal">

You can use the browser's Network tab in developer tools to see that no new info is uploaded or sent when you press the button. So everything is already loaded when you request a page. The "popup" is not even a popup, just an element whose display is changed to none when you click the button.

So Scrapy doesn't really care what's meant to be displayed as long as all html is loaded. If the elements are loaded, they are accessible. Or have you seen some information being unavailable without pressing the button?

You should inspect the html code more to see what each website does, this might make your scraping tasks easier.

Edit: After inspecting the original html you can see the following:

<div class="products-list">
    <div class="products-container-block">
      <div class="products-container">
        <div id="hits" class='row'>
        </div>
      </div>
    </div>
  </div>

You can also see a lot of JS script tags.

The browser element inspector shows us the following: enter image description here

The ::before part gives away that this was manipulated by JS, as you cannot do this with simple CSS. See Granitosaurus' answer for details on this.

What this means is that you need to somehow execute the arbitrary JS code on those pages. So you either need a solution with Scrapy, or just use Selenium, as many do, and as you already have.

Chillie
  • 1,356
  • 13
  • 16
  • 1
    Yup, that window litterally __does nothing__ to affect your scraping. – Granitosaurus Sep 18 '18 at 08:26
  • When I tried to access product urls of that page, I'm getting empty list. After view(response), I got that pop up window which hides everything else. – Muhammad Danial Sep 18 '18 at 08:52
  • I tried response.xpath('//*[@id="hits"]/a/@href').extract() in the spider and getting nothing except empty list. – Muhammad Danial Sep 18 '18 at 08:54
  • @Granitosaurus I appreciate your help but it does nothing to me. Anything in the scrapy which let me click that button except selenium – Muhammad Danial Sep 18 '18 at 09:01
  • @MuhammadDanial your issue is not related to age verification popup, the data is most likely loaded via javascript: https://stackoverflow.com/questions/8550114/can-scrapy-be-used-to-scrape-dynamic-content-from-websites-that-are-using-ajax?rq=1 – Granitosaurus Sep 18 '18 at 09:13
  • @Granitosaurus but I already visited them. All of them led me to the selenium – Muhammad Danial Sep 18 '18 at 09:16
  • If you try to fetch the given link in shell, you'll come to know my problem. That pop windows doesn't allow me to extract the data and that's why I'm getting nothing – Muhammad Danial Sep 18 '18 at 09:18
  • I can see that scrapy loaded all the page's html but why I'm getting empty list. – Muhammad Danial Sep 18 '18 at 09:19
  • 1
    @MuhammadDanial as we already pointed out the pop up is not preventing you from doing anything, where are you getting this? the pop up is a visual clutter nothing more - it does not hide any data. See my answer for more detailed explanation where the data you want actually is. – Granitosaurus Sep 18 '18 at 09:23
  • 2
    @MuhammadDanial I've updated my answer. Sticking with Selenium might be your least time-consuming solution. – Chillie Sep 18 '18 at 09:40
  • I think you're right. Thanks for the editing, Maybe it'll help someone else. – Muhammad Danial Sep 18 '18 at 12:50
2

To expand on Chillie's answer.

The age verification is irrelavant here. The data you are looking for is loaded via AJAX request:

enter image description here

See related question: Can scrapy be used to scrape dynamic content from websites that are using AJAX? to understand how they work.

You need to figure out how https://ns5bwtai8m-dsn.algolia.net/1/indexes/*/queries?x-algolia-agent=Algolia%20for%20vanilla%20JavaScript%203.19.1&x-algolia-application-id=NS5BWTAI8M&x-algolia-api-key=e676b05f3844d3adf54a29732af6e43c url works and how can you retrieve in it scrapy.

Granitosaurus
  • 20,530
  • 5
  • 57
  • 82
  • I think I should stick to the selenium right now. Thanks anyway – Muhammad Danial Sep 18 '18 at 12:49
  • The AJAX request does seem to be awfully complicated, sticking with selenium might be an easier solution here :) Also worth looking into [Splash](https://github.com/scrapinghub/splash) which is an alternative to Selenium if you are looking for faster rendering. – Granitosaurus Sep 18 '18 at 13:48