Harvesting data from webpage in R - accessing multiple pages

Question

I am following my question from yesterday - harvesting data via drop down list in R 1

first, I need to obtain all 50k strings of details of all doctors from this page: http://www.lkcr.cz/seznam-lekaru-426.html#seznam I know, how to obtain them from a single page:

oborID<-"48"
okresID<-"3702"
web<-       "http://www.lkcr.cz/seznam-lekaru-426.html"

extractHTML<-function(oborID,okresID){
query<-list('filterObor'="107",'filterOkresId'="3201",'do[findLekar]'=1)
query$filterObor<-oborID
query$filterOkresId<-okresID
html<-      POST(url=web,body=query)
html<-      content(html, "text")
html
}


IDfromHTML<-function(html){
starting<-  unlist(gregexpr("filterId", html))
ending<-    unlist(gregexpr("DETAIL", html))
starting<-  starting[seq(2,length(starting),2)]

  if (starting != -1 && ending != -1){
    strings<-c()
    for (i in 1:length(starting)) {
  strings[i]<-substr(html,starting[i]+9,ending[i]-18)
  }
strings<-list(strings)
strings
}
}

still, I am aware that downloading whole page for only few lines of text is quite uneffective(but works!:) Could you give me a tip how to make this process more effective?

I have also encountered some pages with more than 20 doctors listed (i.e. combination of "Brno-město" and "chirurgie". Such data are listed and accessed via hyperlink list at the end of the form. I need to access each of these pages and use there the code I presented here. But I guess I have to pass some cookies there.

Other than that, combination of "Praha" and "chirurgie" is problematic as well, because there is more than 200 records, therefore page applies some script and then I need to click the button "další" and use the same method as in the previous paragraph.

Can you help me please?

Why don't you ask that site if they give you the file that contains all of this data? Or somehow find out from which file the search is being executed on! Would save you some time I guess. — Ansjovis86, Oct 27 '16 at 11:30
I asked the Czech medical chamber for the data, they refused as that under no circumstances they do not provide these data. The woman by the phone was quite strict about this rule. — johnnyheineken, Oct 27 '16 at 11:34
Well it is strange that they don't want to give you the data as it can be parsed (with some work) from the site. I guess if you make a query the site returns values from some meta-file containing all data. If you can figure out its location (if permissions allow it), you're done as well. Firefox has a LiveHTTP headers package you can use for printing the actions on a site. Maybe by using that, you can find the location of the file. — Ansjovis86, Oct 27 '16 at 12:21
Yes, it is kinda strange, but I believe that this institution comes from the times of deep communism. Such practice is quite normal here. — johnnyheineken, Oct 28 '16 at 09:17
I am really not sure where to look for the location of the file. I used chrome dev tools, but there is too much information where to look at, and I don't understand them:/ — johnnyheineken, Oct 28 '16 at 09:20

Harvesting data from webpage in R - accessing multiple pages

0 Answers0