0

Well I'm trying to scrape a web-site but i cant reach the URL(action) i want, i already tried to scrape the URL using Scrapy and Selenium and both has failed. If someone can give a tip, or have any clue how i can reach this URL i will be pleased.

Bellow is the code i used to try scrape the URL(action) using Scrapy:

import scrapy
from scrapy.crawler import CrawlerProcess


class TestBMF(scrapy.Spider):
    name = 'test'
    base_url = 'https://www.rad.cvm.gov.br/ENETCONSULTA/frmGerenciaPaginaFRE.aspx?NumeroSequencialDocumento=98307&CodigoTipoInstituicao=2'

    headers = {
        "User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:84.0) Gecko/20100101 Firefox/84.0",
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
        "Accept-Language": "pt-BR,pt;q=0.8,en-US;q=0.5,en;q=0.3",
        "Upgrade-Insecure-Requests": "1"
    }

    def start_requests(self):
        yield scrapy.Request(
            url=self.base_url,
            headers=self.headers,
            callback=self.parse_detail
        )

    def parse_detail(self, response):
        http_code = response.xpath('//iframe[contains(@id, "iFrameFormulariosFilho")]').getall()
        print(http_code)


process = CrawlerProcess()
process.crawl(TestBMF)
process.start()

Follow the return:

2021-02-04 13:45:58 [scrapy.core.engine] INFO: Spider opened
2021-02-04 13:45:58 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2021-02-04 13:45:58 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2021-02-04 13:46:00 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.rad.cvm.gov.br/ENETCONSULTA/frmGerenciaPaginaFRE.aspx?NumeroSequencialDocumento=98307&CodigoTipoInstituicao=2> (referer: None)
['<iframe id="iFrameFormulariosFilho" style="height: 95%; width: 100%; overflow: scroll;" frameborder="0" title="Empresas - Formulário de Referência" height="80%"></iframe>']
2021-02-04 13:46:00 [scrapy.core.engine] INFO: Closing spider (finished)
2021-02-04 13:46:00 [scrapy.statscollectors] INFO: Dumping Scrapy stats

Has can see the Scrapy returns the iframe, but return empty tag.

But if i inspect the URL on Google Chrome or FireFox, i will find this:

Inspection

I'm trying to reach, that URL ACTION inside the //iframe//form/@action

There a few's thing i realized and already tried:

  • If ask to show the url code from browser, the URL code will show-up with empty iframe has Scrapy and Selenium returns.
  • If i inspect the page from Google Chrome or Firefox, the full html code will show-up.
  • Already tried use selenium to get the same xpath and still return the empty iframe
  • If i use simple request, the problem will be the same.
  • If i ask to Scrapy show full the body from the URL, gonna get empty iframe

Well i think is that, and sorry about any English error, isn't my native language.

And thanks everyone to help ;)

Felipe Cid
  • 97
  • 4

1 Answers1

0

You need to use something like this: https://pypi.org/project/scrapy-headless-selenium/

You were on the right lines to use an actual browser, if the website uses Javascript in any significant way you won't be able to see the content unless that runs.

That library has a tool for your use case:

def parse_result(self, response):
    response = response.click('#id')  # equivalent to 
    response.click('//[@id="id"]')
    print(response.selector.xpath('//title/@text'))  # searches the reloaded response body

Which you can use to click that button to perform the relevant POST you need to perform.

Paul Collingwood
  • 9,053
  • 3
  • 23
  • 36