2

I am trying to scrape the following website: https://www.climatempo.com.br/climatologia/558/saopaulo-sp. It has a two drop-down menu with the second depending on the first, so I choose to use scrapy and splash via scrapy-splash.

I need to automate the change of location by selecting first the state, then the city. I tried SplashFormRequest but I am not being able to change the cities list. My spider is (prints for debugging):

import scrapy
from scrapy_splash import SplashRequest, SplashFormRequest


class ExampleSpider(scrapy.Spider):
    name = 'climatologia'

    def start_requests(self):
        urls = ['https://www.climatempo.com.br/climatologia/558/saopaulo-sp']
        for url in urls:
            yield SplashRequest(url=url, callback=self.parse,
                                endpoint='render.html',
                                args={'wait': 0.5},)

    def parse(self, response):
        print(response.url)
        state = response.css("select.slt-geo")[0].css("option::attr(value)").extract()
        print(state)

        return SplashFormRequest(response.url, method='POST',
                                 formdata={'sel-state-geo': 'SP'},
                                 callback=self.state_selected,
                                 args={'wait': 0.5})

    def state_selected(self, response):
        print('\t:+)\t:+)\t:+)\t:+)\t:+)\t:+)')
        print(response.css("select.slt-geo")[0].css("option::text").extract())
        print(response.css("select.slt-geo")[1].css("option::text").extract())
Gallaecio
  • 3,620
  • 2
  • 25
  • 64
Daniel Lima
  • 925
  • 1
  • 8
  • 22

1 Answers1

2

This is a job that I would suggest Selenium for if you absolutely must use the sites menus. The only way to script Splash is through LUA scripts. You would have to send to the execute end point and create a LUA script. I found the options you were trying to select but not where to submit the form or how it functions on the site. I did have to translate to english.

My suggestion is to look in the browser inspector for end points like this is one of several which look particularly interesting: https://www.climatempo.com.br/json/busca-estados

This endpoint gives json like follows

{"success":true,"message":"Resultados encontrados","time":"2017-11-30 16:05:20","totalRows":null,"totalPages":null,"page":null,"data":[{"idlocale":338,"idstate":31,"uf":"AC","state":"Acre","region":"N","latitude":null,"longitude":null},{"idlocale":339,"idstate":49,"uf":"AL","state":"Alagoas","region":"NE","latitude":null,"longitude":null},{"idlocale":340,"idstate":41,"uf":"AM","state":"Amazonas","region":"N","latitude":null,"longitude":null},{"idlocale":341,"idstate":30,"uf":"AP","state":"Amap\u00e1","region":"N","latitude":null,"longitude":null},{"idlocale":342,"idstate":56,"uf":"BA","state":"Bahia","region":"NE","latitude":null,"longitude":null},{"idlocale":343,"idstate":44,"uf":"CE","state":"Cear\u00e1","region":"NE","latitude":null,"longitude":null},{"idlocale":344,"idstate":47,"uf":"DF","state":"Distrito Federal","region":"CO","latitude":null,"longitude":null},{"idlocale":345,"idstate":45,"uf":"ES","state":"Esp\u00edrito Santo","region":"SE","latitude":null,"longitude":null},{"idlocale":346,"idstate":54,"uf":"GO","state":"Goi\u00e1s","region":"CO","latitude":null,"longitude":null},{"idlocale":347,"idstate":52,"uf":"MA","state":"Maranh\u00e3o","region":"NE","latitude":null,"longitude":null},{"idlocale":348,"idstate":53,"uf":"MG","state":"Minas Gerais","region":"SE","latitude":null,"longitude":null},{"idlocale":349,"idstate":39,"uf":"MS","state":"Mato Grosso do Sul","region":"CO","latitude":null,"longitude":null},{"idlocale":350,"idstate":40,"uf":"MT","state":"Mato Grosso","region":"CO","latitude":null,"longitude":null},{"idlocale":351,"idstate":50,"uf":"ND","state":"N\u00e3o Aplic\u00e1vel","region":"ND","latitude":null,"longitude":null},{"idlocale":352,"idstate":55,"uf":"PA","state":"Par\u00e1","region":"N","latitude":null,"longitude":null},{"idlocale":353,"idstate":37,"uf":"PB","state":"Para\u00edba","region":"NE","latitude":null,"longitude":null},{"idlocale":354,"idstate":29,"uf":"PE","state":"Pernambuco","region":"NE","latitude":null,"longitude":null},{"idlocale":355,"idstate":33,"uf":"PI","state":"Piau\u00ed","region":"NE","latitude":null,"longitude":null},{"idlocale":356,"idstate":32,"uf":"PR","state":"Paran\u00e1","region":"S","latitude":null,"longitude":null},{"idlocale":357,"idstate":46,"uf":"RJ","state":"Rio de Janeiro","region":"SE","latitude":null,"longitude":null},{"idlocale":358,"idstate":35,"uf":"RN","state":"Rio Grande do Norte","region":"NE","latitude":null,"longitude":null},{"idlocale":359,"idstate":38,"uf":"RO","state":"Rond\u00f4nia","region":"N","latitude":null,"longitude":null},{"idlocale":360,"idstate":43,"uf":"RR","state":"Roraima","region":"N","latitude":null,"longitude":null},{"idlocale":361,"idstate":48,"uf":"RS","state":"Rio Grande do Sul","region":"S","latitude":null,"longitude":null},{"idlocale":362,"idstate":36,"uf":"SC","state":"Santa Catarina","region":"S","latitude":null,"longitude":null},{"idlocale":363,"idstate":51,"uf":"SE","state":"Sergipe","region":"NE","latitude":null,"longitude":null},{"idlocale":364,"idstate":34,"uf":"SP","state":"S\u00e3o Paulo","region":"SE","latitude":null,"longitude":null},{"idlocale":365,"idstate":42,"uf":"TO","state":"Tocantins","region":"N","latitude":null,"longitude":null}]}

Hopefully this is another way to get the data you are looking for?

Then you can use normal requests to get the data. You would just have to form the request the same. Usually adding an accept, useragent, and requested with header is enough to pass.

eusid
  • 769
  • 2
  • 6
  • 18
  • Thanks for the reply. I also tried to use their API with a POST request to https://www.climatempo.com.br/json/busca-cidades-uf but didn't succeed. The parameter I should send is uf=SP (or any other state abbreviation) but things like https://www.climatempo.com.br/json/busca-cidades-uf?uf=SP give me ''success:false''. – Daniel Lima Nov 30 '17 at 19:59
  • try having the scraper visit the page u would make that post request from so scrapy has the right session data. if there really is no other way you must write a LUA script or use Selenium. I was able to use firefox to resend requests and got a success although I didn't change any parameters. Only thing it could do is check the cookie and if you can replicate it should work 100%. If you could figure it out though using selenium is still easier than spending the time trying. – eusid Nov 30 '17 at 20:31
  • I know its kind of late but maybe accept my answer ? :) – eusid May 12 '19 at 09:58