Well I'm trying to scrape a web-site but i cant reach the URL(action) i want, i already tried to scrape the URL using Scrapy and Selenium and both has failed. If someone can give a tip, or have any clue how i can reach this URL i will be pleased.
Bellow is the code i used to try scrape the URL(action) using Scrapy:
import scrapy
from scrapy.crawler import CrawlerProcess
class TestBMF(scrapy.Spider):
name = 'test'
base_url = 'https://www.rad.cvm.gov.br/ENETCONSULTA/frmGerenciaPaginaFRE.aspx?NumeroSequencialDocumento=98307&CodigoTipoInstituicao=2'
headers = {
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:84.0) Gecko/20100101 Firefox/84.0",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
"Accept-Language": "pt-BR,pt;q=0.8,en-US;q=0.5,en;q=0.3",
"Upgrade-Insecure-Requests": "1"
}
def start_requests(self):
yield scrapy.Request(
url=self.base_url,
headers=self.headers,
callback=self.parse_detail
)
def parse_detail(self, response):
http_code = response.xpath('//iframe[contains(@id, "iFrameFormulariosFilho")]').getall()
print(http_code)
process = CrawlerProcess()
process.crawl(TestBMF)
process.start()
Follow the return:
2021-02-04 13:45:58 [scrapy.core.engine] INFO: Spider opened
2021-02-04 13:45:58 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2021-02-04 13:45:58 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2021-02-04 13:46:00 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.rad.cvm.gov.br/ENETCONSULTA/frmGerenciaPaginaFRE.aspx?NumeroSequencialDocumento=98307&CodigoTipoInstituicao=2> (referer: None)
['<iframe id="iFrameFormulariosFilho" style="height: 95%; width: 100%; overflow: scroll;" frameborder="0" title="Empresas - Formulário de Referência" height="80%"></iframe>']
2021-02-04 13:46:00 [scrapy.core.engine] INFO: Closing spider (finished)
2021-02-04 13:46:00 [scrapy.statscollectors] INFO: Dumping Scrapy stats
Has can see the Scrapy returns the iframe, but return empty tag.
But if i inspect the URL on Google Chrome or FireFox, i will find this:
I'm trying to reach, that URL ACTION inside the //iframe//form/@action
There a few's thing i realized and already tried:
- If ask to show the url code from browser, the URL code will show-up with empty iframe has Scrapy and Selenium returns.
- If i inspect the page from Google Chrome or Firefox, the full html code will show-up.
- Already tried use selenium to get the same xpath and still return the empty iframe
- If i use simple request, the problem will be the same.
- If i ask to Scrapy show full the body from the URL, gonna get empty iframe
Well i think is that, and sorry about any English error, isn't my native language.
And thanks everyone to help ;)