1

I'm trying to get the href from these table contents, but in the html code is not available. [edited @ 3:44 pm 10/02/2019] I will scrape this site and others similar to this one, on a daily basis and compare with the "yesterday" data. So I get the daily new info in this data. [/edited]

I found a similar (but simpler) solution, but it uses chromedriver (link). I'm looking for a solution that doesn't uses Selenium.

Site: http://web.cvm.gov.br/app/esforcosrestritos/#/detalharOferta?ano=MjAxOQ%3D%3D&valor=MTE%3D&comunicado=MQ%3D%3D&situacao=Mg%3D%3D

If you click in the first parte of the table (as below) enter image description here

You will get to this site: http://web.cvm.gov.br/app/esforcosrestritos/#/enviarFormularioEncerramento?type=dmlldw%3D%3D&ofertaId=ODc2MA%3D%3D&state=eyJhbm8iOiJNakF4T1E9PSIsInZhbG9yIjoiTVRFPSIsImNvbXVuaWNhZG8iOiJNUT09Iiwic2l0dWFjYW8iOiJNZz09In0%3D

How can I scrape the first site to get all the links it have in the tables? (to go for the second "links")

When I use requests.get it doesn't even get the content of the table. Any help?

link_cvm = "http://web.cvm.gov.br/app/esforcosrestritos/#/detalharOferta?ano=MjAxOQ%3D%3D&valor=MTE%3D&comunicado=MQ%3D%3D&situacao=Mg%3D%3D"
import requests
html_code = requests.get(link_cvm)
html_code.text
print(html_code)
Felipe Ribeiro
  • 84
  • 1
  • 10
  • Is this a one-time thing? I only ask because you can easily download all the raw data manually from the the DevTools "Network" tab. – Ayman Safadi Oct 02 '19 at 17:44
  • Hi @Ayman, no. I will scrap this site and others similar to this one, on a daily basis and compare with the "yesterday" data. So I get the daily new info in this data. – Felipe Ribeiro Oct 02 '19 at 18:44
  • FYI it’s __scrape__ (and __scraping__, __scraper__, __scraped__) not scrap. ‘To scrap’ means to throw away like rubbish :-( – DisappointedByUnaccountableMod May 18 '21 at 18:28

1 Answers1

1

The second page your are taken to is dynamically loaded using jscript. The data you are looking for is contained in another page, in json format. Search around, there is a lot of information about this, for one, of many, example, see this.

In your case, you can get to it this way:

import requests
import json

url = 'http://web.cvm.gov.br/app/esforcosrestritos/enviarFormularioEncerramento/getOfertaPorId/8760'
resp = requests.get(url)

data = json.loads(resp.content)
print(data)

The output is the information on that page.

Jack Fleeting
  • 24,385
  • 6
  • 23
  • 45
  • Tks @JackFleeting what I need is the get the links to these second pages. When I'm in the second page I can get the data. Any ideas? – Felipe Ribeiro Oct 02 '19 at 19:45
  • @FelipeRibeiro - Click on the link in my answer and read up on using the developer tab in the browser to track down dynamically loaded data. – Jack Fleeting Oct 02 '19 at 19:47
  • Tks friend. At least what I looked it is just getting the data, but not the "href" to the other pages, at least it was what I understood from the example you sent. Did I get it right? Tks a lot for your time and patience, best. – Felipe Ribeiro Oct 02 '19 at 19:59
  • @FelipeRibeiro - It's NOT a simple process, so you have a lot to learn... Try this too: https://ianlondon.github.io/blog/web-scraping-discovering-hidden-apis/. Also, don't forget to accept the answer. – Jack Fleeting Oct 02 '19 at 20:16
  • Tks @Jack. I'm looking foward to do it. Very complex ideed. If I found other solutions I'll post here. – Felipe Ribeiro Oct 03 '19 at 17:11
  • Hi @Jack I posted another question here https://stackoverflow.com/questions/58341926/how-to-get-the-last-table-from-this-site-python , the problem of this question here I solved in a more "manual" way. But now I need help in order to get the last table info (is not in json as I could see). Tks in advance. – Felipe Ribeiro Oct 11 '19 at 13:08