7

the code below works fine in interactive mode but fails when used in a function. it's pretty simply two authentications POST commands followed by the data download. my goal is to get this working inside a function, not just in interactive mode.

this question is sort of a sequel to this question.. icpsr recently updated their website. the minimal reproducible example below requires a free account, available at

https://www.icpsr.umich.edu/rpxlogin?path=ICPSR&request_uri=https%3a%2f%2fwww.icpsr.umich.edu%2ficpsrweb%2findex.jsp

i tried adding Sys.sleep(1) and various httr::GET/httr::POST calls but nothing worked.

my_download <-
    function( your_email , your_password ){

        values <-
            list(
                agree = "yes",
                path = "ICPSR" ,
                study = "21600" ,
                ds = "" ,
                bundle = "rdata",
                dups = "yes",
                email=your_email,
                password=your_password
            )


        httr::POST("https://www.icpsr.umich.edu/cgi-bin/terms", body = values)
        httr::POST("https://www.icpsr.umich.edu/rpxlogin", body = values)

        tf <- tempfile()
        httr::GET( 
            "https://www.icpsr.umich.edu/cgi-bin/bob/zipcart2" , 
            query = values , 
            httr::write_disk( tf , overwrite = TRUE ) , 
            httr::progress()
        )

    }

# fails 
my_download( "email@address.com" , "some_password" )

# stepping through works
debug( my_download )
my_download( "email@address.com" , "some_password" )

EDIT the failure simply downloads this page as if not logged in (and not the dataset), so it's losing the authentication for some reason. if you are logged in to icpsr, use private browsing to see the page--

https://www.icpsr.umich.edu/cgi-bin/bob/zipcart2?study=21600&ds=1&bundle=rdata&path=ICPSR

thanks!

Uddhav P. Gautam
  • 7,362
  • 3
  • 47
  • 64
Anthony Damico
  • 5,779
  • 7
  • 46
  • 77
  • 1
    so where / how exactly does it fail when used via the function? – RolandASc Feb 23 '18 at 17:13
  • @RolandASc sorry for not including that. see edit..thank you – Anthony Damico Feb 23 '18 at 17:36
  • 3
    https://www.icpsr.umich.edu/robots.txt suggests this activity is not authorized (and `robots.txt` is currently a bona fide technical control upheld in — at least U.S. — civil courts). Unless one has written permission to automate access, it's not a good idea to pursue this. – hrbrmstr Feb 24 '18 at 15:47
  • I suggest ignoring @hrbrmstr's hand-wringing about robots.txt. At least it is not clear that a) your script qualifies as a "robot", or b) that respecting restrictions specified in robots.txt is necessarily a good idea. See https://en.wikipedia.org/wiki/Robots_exclusion_standard for relatively unbiased information on this issue. – Ista Mar 03 '18 at 00:41
  • For me running the function a second time works. So it's not about running it line-by-line, but rather whether it's been run before. In practical terms: just run it twice. – Ista Mar 03 '18 at 01:41
  • @Ista bizarro. yes, running the three `POST` and `GET` commands twice triggers the download within the function. happy to award the bounty if you want to make that an answer. thanks very much! – Anthony Damico Mar 03 '18 at 12:38
  • Nice job encouraging unethical and (depending on the jurisdiction) criminal actions, @Ista – hrbrmstr Mar 03 '18 at 13:02
  • Nice job trying to derail this question with irrelevant opinions @hrbrmstr. If you want to talk about legal issues please take it over to https://law.stackexchange.com/ – Ista Mar 03 '18 at 13:56
  • @AnthonyDamico I'm going to look into it a bit more to see if I can actually understand what is happening before writing up an answer. It will a while before I have time to do that, hopefully someone else will beat me to it. – Ista Mar 03 '18 at 15:52

1 Answers1

1

This sort of thing can happen because the state (such as cookies) the httr package stores in the handle for each URL (see ?handle).

In this particular case it remains unclear what exactly make it work, but one strategy is to include a GET request to https://www.icpsr.umich.edu/cgi-bin/bob/ prior to authenticating and requesting the data. For example,

my_download <-
    function( your_email , your_password ){
        ## for some reason this is required ...
        httr::GET("https://www.icpsr.umich.edu/cgi-bin/bob/")
        values <-
            list(
                agree = "yes",
                path = "ICPSR" ,
                study = "21600" ,
                ds = "" ,
                bundle = "rdata",
                dups = "yes",
                email=your_email,
                password=your_password
            )
        httr::POST("https://www.icpsr.umich.edu/rpxlogin", body = values)
        httr::POST("https://www.icpsr.umich.edu/cgi-bin/terms", body = values)
        tf <- tempfile()
        httr::GET( 
            "https://www.icpsr.umich.edu/cgi-bin/bob/zipcart2" , 
            query = values , 
            httr::write_disk( tf , overwrite = TRUE ) , 
            httr::progress()
        )
    }

appears to work correctly, though it remains unclear what the GET request to https://www.icpsr.umich.edu/cgi-bin/bob/` does exactly or why it is needed.

Ista
  • 10,139
  • 2
  • 37
  • 38