I'm testing some web scrape scripts in R. I've read many tutorials, docs and tried different things but no success so far.
The URL I'm trying to scrape is this one. It has public, government data, and no statements against web scrapers. It's in Portuguese, but I believe it won't be a big problem.
It shows a search form, with several fields. My test was searching for data from a particular state ("RJ", in this case the field is "UF"), and city ("Rio de Janeiro", in the field "MUNICIPIO"). By clicking "Pesquisar" (Search), it shows the following output:
Using Firebug, I found the URL it calls (using the parameters above) is:
http://www.dataescolabrasil.inep.gov.br/dataEscolaBrasil/home.seam?buscaForm=buscaForm&codEntidadeDecorate%3AcodEntidadeInput=&noEntidadeDecorate%3AnoEntidadeInput=&descEnderecoDecorate%3AdescEnderecoInput=&estadoDecorate%3A**estadoSelect=33**&municipioDecorate%3A**municipioSelect=3304557**&bairroDecorate%3AbairroInput=&pesquisar.x=42&pesquisar.y=16&javax.faces.ViewState=j_id10
The site uses a jsessionid, as can be seen using the following:
library(rvest)
library(httr)
url <- GET("http://www.dataescolabrasil.inep.gov.br/dataEscolaBrasil/")
cookies(url)
Knowing it uses a jsessionid, I used cookies(url) to check this info, and used it into a new URL like this:
url <- read_html("http://www.dataescolabrasil.inep.gov.br/dataEscolaBrasil/home.seam;jsessionid=008142964577DBEC622E6D0C8AF2F034?buscaForm=buscaForm&codEntidadeDecorate%3AcodEntidadeInput=33108064&noEntidadeDecorate%3AnoEntidadeInput=&descEnderecoDecorate%3AdescEnderecoInput=&estadoDecorate%3AestadoSelect=org.jboss.seam.ui.NoSelectionConverter.noSelectionValue&bairroDecorate%3AbairroInput=&pesquisar.x=65&pesquisar.y=8&javax.faces.ViewState=j_id2")
html_text(url)
Well, the output doesn't have the data. In fact, it has a error message. Translated into English, it basically says the session was expired.
I assume it is a basic mistake, but I looked all around and couldn't find a way to overcome this.