R: scraping additional data after POST only works for first page

Question

I would like to scrape drug informations offered by the Swiss government for an University research project from:

http://www.spezialitaetenliste.ch/ShowPreparations.aspx?searchType=Substance&searchValue=

The page does offer a robotx.txt file, however, it's content is freely available to the public and I assume that scraping this data is unprohibited.

This is an update of this question, since I made some progress.

What I achieved so far

# opens the first results page 
# opens the first link as a table at the end of the page

library("rvest")
library("dplyr")


url <- "http://www.spezialitaetenliste.ch/ShowPreparations.aspx?searchType=Substance&searchValue="
pgsession<-html_session(url)
pgform<-html_form(pgsession)[[1]]

page<-rvest:::request_POST(pgsession,url,
                           body=list(
                             `ctl00$cphContent$gvwPreparations$ctl13$gvwpPreparations$txtPageNumber`=1,
                             `__VIEWSTATE`=pgform$fields$`__VIEWSTATE`$value,
                             `__VIEWSTATEGENERATOR`=pgform$fields$`__VIEWSTATEGENERATOR`$value,
                             `__VIEWSTATEENCRYPTED`=pgform$fields$`__VIEWSTATEENCRYPTED`$value,
                             `__EVENTVALIDATION`=pgform$fields$`__EVENTVALIDATION`$value,
                             `ctl00$cphContent$gvwPreparations$ctl13$gvwpPreparations$ddlPageSize`="10",
                             `__EVENTTARGET`="ctl00$cphContent$gvwPreparations$ctl02$ctl00",
                             `__EVENTARGUMENT`=""

                             ),
                           encode="form")

next: get the basic data

# makes a table of all results of the first page

read_html(page) %>%
  html_nodes(xpath = '//*[@id="ctl00_cphContent_gvwPreparations"]') %>%
  html_table(fill=TRUE) %>% 
  bind_rows %>%
  tibble()

next: get the additional data

# gives the desired informations (=additional data) of the first drug (not yet very structured)

read_html(page) %>%
  html_nodes(xpath = '//*[@id="ctl00_cphContent_fvwPreparation"]') %>%
  html_text

My Problem:

# if I open the second  search page

page<-rvest:::request_POST(pgsession,url,
                           body=list(
                             `ctl00$cphContent$gvwPreparations$ctl13$gvwpPreparations$txtPageNumber`=2,
                             `__VIEWSTATE`=pgform$fields$`__VIEWSTATE`$value,
                             `__VIEWSTATEGENERATOR`=pgform$fields$`__VIEWSTATEGENERATOR`$value,
                             `__VIEWSTATEENCRYPTED`=pgform$fields$`__VIEWSTATEENCRYPTED`$value,
                             `__EVENTVALIDATION`=pgform$fields$`__EVENTVALIDATION`$value,
                             `ctl00$cphContent$gvwPreparations$ctl13$gvwpPreparations$ddlPageSize`="10",
                             `__EVENTTARGET`="ctl00$cphContent$gvwPreparations$ctl02$ctl00",
                             `__EVENTARGUMENT`=""

                             ),
                           encode="form")

next: get the new basic data

# I get easily a table with the new results

read_html(page) %>%
  html_nodes(xpath = '//*[@id="ctl00_cphContent_gvwPreparations"]') %>%
  html_table(fill=TRUE) %>% 
  bind_rows %>%
  tibble()

But if I try to get the new additional data, I get the results from page 1 again:

# does not give the desired output:

read_html(page) %>%
  html_nodes(xpath = '//*[@id="ctl00_cphContent_fvwPreparation"]') %>%
  html_text

What I am looking for: the detailed data of the first drug of page 2

Questions:

Why do I get duplicate results? Is it because of the __VIEWSTATE that might change during the new request_POST ?
Is there any way to solve this problem?
Is there any better way how to get the basic and additional data? If yes, how?

score 4 · Answer 1 · answered May 10 '19 at 00:59

I think you are simply overthinking the problem. The issue lies in the xpath. Essentially the xpath that you are using for data extraction is the same for all pages. And it is, //*[@id="ctl00_cphContent_gvwPreparations"] The only component that is changing in your code is the txtPageNumber. In the below code, I've changed the txtPageNumber to 3, like, txtPageNumber=3 I suggest your focus should be on something like, How to automate page numbering for data extraction?. This way, you'll not have to manually change the txtPageNumber in

page<-rvest:::request_POST(pgsession,url,
                           body=list(
                             `ctl00$cphContent$gvwPreparations$ctl13$gvwpPreparations$txtPageNumber`=3,
                             `__VIEWSTATE`=pgform$fields$`__VIEWSTATE`$value,
                             `__VIEWSTATEGENERATOR`=pgform$fields$`__VIEWSTATEGENERATOR`$value,
                             `__VIEWSTATEENCRYPTED`=pgform$fields$`__VIEWSTATEENCRYPTED`$value,
                             `__EVENTVALIDATION`=pgform$fields$`__EVENTVALIDATION`$value,
                             `ctl00$cphContent$gvwPreparations$ctl13$gvwpPreparations$ddlPageSize`="10",
                             `__EVENTTARGET`="ctl00$cphContent$gvwPreparations$ctl02$ctl00",
                             `__EVENTARGUMENT`=""

                           ),
                           encode="form")

The following code worked for me;

library(rvest)
library(dplyr)


url <- "http://www.spezialitaetenliste.ch/ShowPreparations.aspx?searchType=Substance&searchValue="
pgsession<-html_session(url)
pgform<-html_form(pgsession)[[1]]

page<-rvest:::request_POST(pgsession,url,
                           body=list(
                             `ctl00$cphContent$gvwPreparations$ctl13$gvwpPreparations$txtPageNumber`=3,
                             `__VIEWSTATE`=pgform$fields$`__VIEWSTATE`$value,
                             `__VIEWSTATEGENERATOR`=pgform$fields$`__VIEWSTATEGENERATOR`$value,
                             `__VIEWSTATEENCRYPTED`=pgform$fields$`__VIEWSTATEENCRYPTED`$value,
                             `__EVENTVALIDATION`=pgform$fields$`__EVENTVALIDATION`$value,
                             `ctl00$cphContent$gvwPreparations$ctl13$gvwpPreparations$ddlPageSize`="10",
                             `__EVENTTARGET`="ctl00$cphContent$gvwPreparations$ctl02$ctl00",
                             `__EVENTARGUMENT`=""

                           ),
                           encode="form")
# makes a table of all results of the first page

read_html(page) %>%
  html_nodes(xpath = '//*[@id="ctl00_cphContent_gvwPreparations"]') %>%
  html_table(fill=TRUE) %>% 
  bind_rows %>%
  tibble()

# A tibble: 11 x 1
   .$``  $Präparat $`Galen. Form /~ $Packung $FAP  $PP   $SB   $`Lim-Pkt` $Lim 
   <chr> <chr>     <chr>            <chr>    <chr> <chr> <chr> <chr>      <chr>
 1 21.   Accolate  Tabl 20 mg       60 Stk   29.75 50.55 ""    ""         ""   
 2 22.   Accupaque Inj Lös 300 mg   Plast F~ 32.00 53.10 ""    ""         ""   
 3 23.   Accupaque Inj Lös 300 mg   Plast F~ 61.15 86.60 ""    ""         ""   
 4 24.   Accupaque Inj Lös 300 mg   Plast F~ 120.~ 154.~ ""    ""         ""   
 5 25.   Accupaque Inj Lös 350 mg   Plast F~ 33.97 55.35 ""    ""         ""   
 6 26.   Accupaque Inj Lös 350 mg   Plast F~ 66.88 93.20 ""    ""         ""   
 7 27.   Accupaque Inj Lös 350 mg   Plast F~ 129.~ 164.~ ""    ""         ""   
 8 28.   Accupro ~ Filmtabl 10 mg   30 Stk   8.56  18.00 ""    ""         ""   
 9 29.   Accupro ~ Filmtabl 10 mg   100 Stk  26.60 46.90 ""    ""         ""   
10 30.   Accupro ~ Filmtabl 20 mg   30 Stk   14.02 28.35 ""    ""         ""   
11 "Ein~ "Einträg~ "Einträge pro S~ "Einträ~ "Ein~ "Ein~ "Ein~ "Einträge~ "Ein~
# ... with 9 more variables: $`Swissmedic-Code` <chr>, $Zulassungsinhaberin <chr>,
#   $Wirkstoff <chr>, $`BAG-Dossier` <chr>, $Aufnahme <chr>, $`Befr. AufnahmeBefr.
#   Limitation` <chr>, $`O/G` <chr>, $`IT-Code` <chr>, $`ATC-Code` <chr>

# gives the desired informations of the first drug (not yet very structured)

read_html(page) %>%
  html_nodes(xpath = '//*[@id="ctl00_cphContent_gvwPreparations"]') %>%
  html_text %>%
  head(10)


[1] " PräparatGalen. Form / DosierungPackungFAPPPSBLim-PktLimSwissmedic-CodeZulassungsinhaberinWirkstoffBAG-DossierAufnahmeBefr. AufnahmeBefr. LimitationO/GIT-CodeATC-Code\r\n\t\t\t\t\r\n                        21.\r\n                    \r\n                        Accolate\r\n                    \r\n                        Tabl 20 mg \r\n                    \r\n                        60 Stk\r\n                    \r\n                        29.75\r\n                    \r\n                        50.55\r\n                    \r\n                                                \r\n                    \r\n                        \r\n                    \r\n                      \r\n                    \r\n                        53750036\r\n                    \r\n                        AstraZeneca AG\r\n                    \r\n                        Zafirlukastum\r\n                    \r\n                        17053\r\n                    \r\n                        15.03.1998\r\n                    \r\n                        \r\n                        \r\n                    \r\n                        \r\n                    \r\n                        03.04.50.\r\n                    \r\n                        R03DC01\r\n                    \r\n\t\t\t\t\r\n                        22.\r\n                    \r\n                        Accupaque\r\n                    \r\n

Thank you for your advice. In the second part, I am interested in the additional drug information that is displayed if one clicks on the Präparat name (I added a screenshot above). Therefore, the xpath is not the same. Your second output gives the same information as the tibble from your first output. Once I get this single step (=getting the table information + additional information) right, I can automate this by looping trough the pages and the additional informations (e.g. nr +1 and gvw$Preparations$.. = gvw$Preparations$ctl(Nr in the List +1)$ct100) — captcoma, May 10 '19 at 08:26
@captcoma what baffles me is that you've completely changed the question structure after I answered it? My answer was for the **initial question** that was posted earlier. But now, you've changed its structure to something else. In such as situation, its advised to ask a new question. Anyway, I wish you luck in whatever you want to accomplish. — mnm, May 11 '19 at 01:53
This is already a new question of an older one. Since you answered it, I only made minor changes to make it more clear. There was no change to the content. Everybody can see this by checking the versions. Can you describe the 'complete' change? — captcoma, May 11 '19 at 07:54

R: scraping additional data after POST only works for first page

1 Answers1

Linked