Webscraping, on scrapy trying to reach some URL

Question

Well I'm trying to scrape a web-site but i cant reach the URL(action) i want, i already tried to scrape the URL using Scrapy and Selenium and both has failed. If someone can give a tip, or have any clue how i can reach this URL i will be pleased.

Bellow is the code i used to try scrape the URL(action) using Scrapy:

import scrapy
from scrapy.crawler import CrawlerProcess


class TestBMF(scrapy.Spider):
    name = 'test'
    base_url = 'https://www.rad.cvm.gov.br/ENETCONSULTA/frmGerenciaPaginaFRE.aspx?NumeroSequencialDocumento=98307&CodigoTipoInstituicao=2'

    headers = {
        "User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:84.0) Gecko/20100101 Firefox/84.0",
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
        "Accept-Language": "pt-BR,pt;q=0.8,en-US;q=0.5,en;q=0.3",
        "Upgrade-Insecure-Requests": "1"
    }

    def start_requests(self):
        yield scrapy.Request(
            url=self.base_url,
            headers=self.headers,
            callback=self.parse_detail
        )

    def parse_detail(self, response):
        http_code = response.xpath('//iframe[contains(@id, "iFrameFormulariosFilho")]').getall()
        print(http_code)


process = CrawlerProcess()
process.crawl(TestBMF)
process.start()

Follow the return:

2021-02-04 13:45:58 [scrapy.core.engine] INFO: Spider opened
2021-02-04 13:45:58 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2021-02-04 13:45:58 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2021-02-04 13:46:00 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.rad.cvm.gov.br/ENETCONSULTA/frmGerenciaPaginaFRE.aspx?NumeroSequencialDocumento=98307&CodigoTipoInstituicao=2> (referer: None)
['<iframe id="iFrameFormulariosFilho" style="height: 95%; width: 100%; overflow: scroll;" frameborder="0" title="Empresas - Formulário de Referência" height="80%"></iframe>']
2021-02-04 13:46:00 [scrapy.core.engine] INFO: Closing spider (finished)
2021-02-04 13:46:00 [scrapy.statscollectors] INFO: Dumping Scrapy stats

Has can see the Scrapy returns the iframe, but return empty tag.

But if i inspect the URL on Google Chrome or FireFox, i will find this:

I'm trying to reach, that URL ACTION inside the //iframe//form/@action

There a few's thing i realized and already tried:

If ask to show the url code from browser, the URL code will show-up with empty iframe has Scrapy and Selenium returns.
If i inspect the page from Google Chrome or Firefox, the full html code will show-up.
Already tried use selenium to get the same xpath and still return the empty iframe
If i use simple request, the problem will be the same.
If i ask to Scrapy show full the body from the URL, gonna get empty iframe

Well i think is that, and sorry about any English error, isn't my native language.

And thanks everyone to help ;)

score 0 · Answer 1 · answered Feb 04 '21 at 17:36

0

You need to use something like this: https://pypi.org/project/scrapy-headless-selenium/

You were on the right lines to use an actual browser, if the website uses Javascript in any significant way you won't be able to see the content unless that runs.

That library has a tool for your use case:

def parse_result(self, response):
    response = response.click('#id')  # equivalent to 
    response.click('//[@id="id"]')
    print(response.selector.xpath('//title/@text'))  # searches the reloaded response body

Which you can use to click that button to perform the relevant POST you need to perform.

answered Feb 04 '21 at 17:36

Paul Collingwood

9,053
3
23
36

Hey paul thanks for replay, but the base_url don't have any button to click for make any request. – Felipe Cid Feb 04 '21 at 18:21
https://stackoverflow.com/questions/30342243/send-post-request-in-scrapy – Paul Collingwood Feb 04 '21 at 19:25
When you GET that url you are not making a POST request. So you have to change your approach – Paul Collingwood Feb 04 '21 at 19:26

Webscraping, on scrapy trying to reach some URL

1 Answers1