0

SourceForge Research Data Archive (SRDA) is one of the data sources for my dissertation research. I'm having difficulty in debugging the following issue related to SRDA data collection.

Data collection from SRDA requires authentication and then submitting Web form with an SQL query. Upon successful processing of the query, the system generates a text file with query results. While testing my R code for SRDA data collection, I've changed the SQL request to make sure that the results file is being regenerated. However, I've discovered that the file contents stays the same (corresponds to previous query). I think that the lack of refresh of the file contents could be due to failure of either authentication, or query form submission. The following is the debug output from the code (https://github.com/abnova/diss-floss/blob/master/import/getSourceForgeData.R):

make importSourceForge

Rscript --no-save --no-restore --verbose getSourceForgeData.R
running
  '/usr/lib/R/bin/R --slave --no-restore --no-save --no-restore --file=getSourceForgeData.R'

Loading required package: RCurl
Loading required package: methods
Loading required package: bitops
Loading required package: digest

Retrieving SourceForge data...

Checking request "SELECT *
FROM sf1104.users a, sf1104.artifact b
WHERE a.user_id = b.submitted_by AND b.artifact_id = 304727"...
* About to connect() to zerlot.cse.nd.edu port 80 (#0)
*   Trying 129.74.152.47... * connected
> POST /mediawiki/index.php?title=Special:Userlogin&action=submitlogin&type=login HTTP/1.1
Host: zerlot.cse.nd.edu
Accept: */*
Content-Length: 37
Content-Type: application/x-www-form-urlencoded

* upload completely sent off: 37out of 37 bytes
< HTTP/1.1 200 OK
< Date: Tue, 11 Mar 2014 03:49:04 GMT
< Server: Apache/2.2.8 (Ubuntu) PHP/5.2.4-2ubuntu5.25 with Suhosin-Patch
< X-Powered-By: PHP/5.2.4-2ubuntu5.25
* Added cookie wiki_db_session="c61...a3c" for domain zerlot.cse.nd.edu, path /, expire 0
< Set-Cookie: wiki_db_session=c61...a3c; path=/
< Content-language: en
< Vary: Accept-Encoding,Cookie
< Expires: Thu, 01 Jan 1970 00:00:00 GMT
< Cache-Control: private, must-revalidate, max-age=0
< Transfer-Encoding: chunked
< Content-Type: text/html; charset=UTF-8
<
* Connection #0 to host zerlot.cse.nd.edu left intact
[1] "Before second postForm()"
* Re-using existing connection! (#0) with host zerlot.cse.nd.edu
* Connected to zerlot.cse.nd.edu (129.74.152.47) port 80 (#0)
> POST /cgi-bin/form.pl HTTP/1.1
Host: zerlot.cse.nd.edu
Accept: */*
Cookie: wiki_db_session=c61...a3c
Content-Length: 129
Content-Type: application/x-www-form-urlencoded

* upload completely sent off: 129out of 129 bytes
< HTTP/1.1 500 Internal Server Error
< Date: Tue, 11 Mar 2014 03:49:04 GMT
< Server: Apache/2.2.8 (Ubuntu) PHP/5.2.4-2ubuntu5.25 with Suhosin-Patch
< Vary: Accept-Encoding
< Connection: close
< Transfer-Encoding: chunked
< Content-Type: text/html
<
* Closing connection #0
Error: Internal Server Error
Execution halted
make: *** [importSourceForge] Error 1

I've tried to figure this out using debug output as well as Network protocol analyzer from Firefox embedded Developer Tools, but so far without much success. Would appreciate any advice and help.

UPDATE:

if (!require(RCurl)) install.packages('RCurl')
if (!require(digest)) install.packages('digest')

library(RCurl)
library(digest)

# Users must authenticate to access Query Form
SRDA_HOST_URL  <- "http://zerlot.cse.nd.edu"
SRDA_LOGIN_URL <- "/mediawiki/index.php?title=Special:Userlogin"
SRDA_LOGIN_REQ <- "&action=submitlogin&type=login"

# SRDA URL that Query Form sends POST requests to
SRDA_QUERY_URL <- "/cgi-bin/form.pl"

# SRDA URL that Query Form sends POST requests to
SRDA_QRESULT_URL <- "/qresult/blekh/blekh.txt"

# Parameters for result's format
DATA_SEP <- ":" # data separator
ADD_SQL  <- "1" # add SQL to file

curl <<- getCurlHandle()

srdaLogin <- function (loginURL, username, password) {

  curlSetOpt(curl = curl, cookiejar = 'cookies.txt',
             ssl.verifyhost = FALSE, ssl.verifypeer = FALSE,
             followlocation = TRUE, verbose = TRUE)

  params <- list('wpName1' = username, 'wpPassword1' = password)

  if(url.exists(loginURL)) {
    reply <- postForm(loginURL, .params = params, curl = curl,
                      style = "POST")
    #if (DEBUG) print(reply)
    info <- getCurlInfo(curl)
    return (ifelse(info$response.code == 200, TRUE, FALSE))
  }
  else {
    error("Can't access login URL!")
  }
}


srdaConvertRequest <- function (request) {

  return (list(select = "*",
               from = "sf1104.users a, sf1104.artifact b",
               where = "b.artifact_id = 304727"))
}


srdaRequestData <- function (requestURL, select, from, where, sep, sql) {

  params <- list('uitems' = select,
                 'utables' = from,
                 'uwhere' = where,
                 'useparator' = sep,
                 'append_query' = sql)

  if(url.exists(requestURL)) {
    reply <- postForm(requestURL, .params = params, #.opts = opts,
                      curl = curl, style = "POST")
  }
}


srdaGetData <- function(request) {

  resultsURL <- paste(SRDA_HOST_URL, SRDA_QRESULT_URL,
                      collapse="", sep="")

  results.query <- readLines(resultsURL, n = 1)

  return (ifelse(results.query == request, TRUE, FALSE))
}


getSourceForgeData <- function (request) {

  # Construct SRDA login and query URLs
  loginURL <- paste(SRDA_HOST_URL, SRDA_LOGIN_URL, SRDA_LOGIN_REQ,
                    collapse="", sep="")
  queryURL <- paste(SRDA_HOST_URL, SRDA_QUERY_URL, collapse="", sep="")

  # Log into the system 
  if (!srdaLogin(loginURL, USER, PASS))
    error("Login failed!")

  rq <- srdaConvertRequest(request)

  srdaRequestData(queryURL,
                  rq$select, rq$from, rq$where, DATA_SEP, ADD_SQL)

  if (!srdaGetData(request))
    error("Data collection failed!")
}


message("\nTesting SourceForge data collection...\n")

getSourceForgeData("SELECT * 
FROM sf1104.users a, sf1104.artifact b 
WHERE a.user_id = b.submitted_by AND b.artifact_id = 304727")

# clean up
close(curl)

UPDATE 2 (no functions version):

if (!require(RCurl)) install.packages('RCurl')
library(RCurl)

# Users must authenticate to access Query Form
SRDA_HOST_URL  <- "http://zerlot.cse.nd.edu"
SRDA_LOGIN_URL <- "/mediawiki/index.php?title=Special:Userlogin"
SRDA_LOGIN_REQ <- "&action=submitlogin&type=login"

# SRDA URL that Query Form sends POST requests to
SRDA_QUERY_URL <- "/cgi-bin/form.pl"

# SRDA URL that Query Form sends POST requests to
SRDA_QRESULT_URL <- "/qresult/blekh/blekh.txt"

# Parameters for result's format
DATA_SEP <- ":" # data separator
ADD_SQL  <- "1" # add SQL to file


message("\nTesting SourceForge data collection...\n")

curl <- getCurlHandle()

curlSetOpt(curl = curl, cookiejar = 'cookies.txt',
           ssl.verifyhost = FALSE, ssl.verifypeer = FALSE,
           followlocation = TRUE, verbose = TRUE)

# === Authentication ===

loginParams <- list('wpName1' = USER, 'wpPassword1' = PASS)

loginURL <- paste(SRDA_HOST_URL, SRDA_LOGIN_URL, SRDA_LOGIN_REQ,
                  collapse="", sep="")

if (url.exists(loginURL)) {
  postForm(loginURL, .params = loginParams, curl = curl, style = "POST")
  info <- getCurlInfo(curl)
  message("\nLogin results - HTTP status code: ", info$response.code, "\n\n")
} else {
  error("\nCan't access login URL!\n\n")
}

# === Data collection ===

# Previous query was: "SELECT * FROM sf0305.users WHERE user_id < 100"
query <- list(select = "*",
              from = "sf1104.users a, sf1104.artifact b",
              where = "b.artifact_id = 304727") 

getDataParams <- list('uitems'       = query$select,
                      'utables'      = query$from,
                      'uwhere'       = query$where,
                      'useparator'   = DATA_SEP,
                      'append_query' = ADD_SQL)

queryURL <- paste(SRDA_HOST_URL, SRDA_QUERY_URL, collapse="", sep="")

if(url.exists(queryURL)) {
  postForm(queryURL, .params = getDataParams, curl = curl, style = "POST")
  resultsURL <- paste(SRDA_HOST_URL, SRDA_QRESULT_URL,
                      collapse="", sep="")
  results.query <- readLines(resultsURL, n = 1)
  request <- paste(query$select, query$from, query$where)
  if (results.query == request)
    message("\nData request is successful, SQL query: ", request, "\n\n")
  else
    message("\nData request failed, SQL query: ", request, "\n\n")
} else {
  error("\nCan't access data query URL!\n\n")
}

close(curl)

UPDATE 3 (server-side debugging)

Finally, I was able to get in touch with a person responsible for the system and he helped me to narrow down the issue to cookie management IMHO. Here's the error log record, corresponding to running my code:

[Fri Mar 21 15:33:14 2014] [error] [client 54.204.180.203] [Fri Mar 21 15:33:14 2014] form.pl: /tmp/sess_3e55593e436a013597cd320e4c6a2fac: at /var/www/cgi-bin/form.pl line 43

The following is the snippet of the server-side script (Perl) that generated that error (line #1 in the script is bash interpreter directive, so reported line number 43 is most likely line number 44):

42     if (-e "/tmp/sess_$file") {
43     $session = PHP::Session->new($cgi->cookie("$session_name"));
44     $user_id = $session->get('wsUserID');
45     $user_name = $session->get('wsUserName');

The following is a session information (1) after authentication and (2) after submitting data request, obtained by tracing manual authentication and manual data request form submission:

(1) "wiki_dbUserID=449; expires=Sun, 20-Apr-2014 21:04:14 GMT; path=/wiki_dbUserName=Blekh; expires=Sun, 20-Apr-2014 21:04:14 GMT; path=/wiki_dbToken=deleted; expires=Thu, 21-Mar-2013 21:04:13 GMT"

(2) wiki_db_session=aaed058f97059174a59effe44b137cbc; _ga=GA1.2.2065853334.1395410153; EDSSID=e24ff5ed891c28c61f2d1f8dec424274; wiki_dbUserName=Blekh; wiki_dbLoggedOut=20140321210314; wiki_dbUserID=449

Would appreciate any help in figuring out the problem with my code!

Aleksandr Blekh
  • 2,462
  • 4
  • 32
  • 64
  • You'll need to show your R code - how are you ensuring that the cookies are preserved between requests? (httr does that by default, but RCurl does not) – hadley Mar 12 '14 at 14:24
  • @hadley: I provided link to my source code at GitHub (just before the output). – Aleksandr Blekh Mar 12 '14 at 21:00
  • That's a lot of code. I'd recommend weeding it down to a simple reproducible test case. – hadley Mar 13 '14 at 01:41
  • @hadley: I'm updating my question with test case per your recommendation. It's not as small as I wanted it to be, but I did my best to minimize and simplify the code without too much deviation from my real life scenario. I already tested it and the results are the same. One interesting thing is that I discovered the lack of a space in my SQL request (fixed that), which should've triggered an appropriate SQL syntax message. To me it means that the query even doesn't reach to the point of being parsed. Sending you USER and PASS as I don't want to compromise access to the system. – Aleksandr Blekh Mar 13 '14 at 03:01
  • @Please ignore my note on SQL query syntax - it was fine. – Aleksandr Blekh Mar 13 '14 at 04:27
  • I think you can make it simpler by removing all the functions and just sending the requests in order – hadley Mar 13 '14 at 13:36
  • @hadley: Done. Please see Update 2. – Aleksandr Blekh Mar 13 '14 at 21:08
  • @AleksandrBlekh according to the script, if you provide wrong login credentials it shows the message error("Login failed!") ?? – Hackerman Apr 16 '14 at 20:05
  • @RobertRozas: Actually, I provide correct login credentials and it returns 200 (OK) on authentication. I don't want to publish them in the open, but, if you'd like to help, please let me know and I will e-mail them directly to you. The server-side Perl script (not my R code) fails (with HTTP status 500) on attempt to process submitted form with SQL query (see Update 3 above). – Aleksandr Blekh Apr 16 '14 at 22:02
  • In your perl code, just to discard some other things...can you comment the line 43 and provide hardcoded user_id and user_name just to see if it works in that way?? – Hackerman Apr 17 '14 at 00:28
  • @RobertRozas: Unfortunately, I have no control over the server-side (Perl) code. That server-side code snippet has been provided to me by the SRDA administrator, when he was trying to help me during a joint debugging session. As I said, I can e-mail you user ID and name to use in R code, but the problem seems to be in a cookie management mismatch between client side and server side. – Aleksandr Blekh Apr 17 '14 at 00:56
  • Ok, i'm gonna wait for the email on linkedin; i want to try with a php version of your script too...cheers – Hackerman Apr 17 '14 at 03:12
  • @RobertRozas: Just sent you message on LinkedIn with credentials. – Aleksandr Blekh Apr 17 '14 at 05:09

3 Answers3

1

Finally, finally, finally! I have figured out what was causing this problem, which gave me so much headache (figuratively and literally). It forced me to spend a lot of time reading various Internet resources (including many SO questions and answers), debugging my code and communicating with people. I spent a lot of time, but not in vain, as I learned a lot about RCurl, cookies, Web forms and HTTP protocol.

The reason appeared much simpler than I thought. While the direct reason of the form submission failure was related to cookie management, the underlying reason was using wrong parameter names (IDs) of the authentication form fields. The two pairs were very similar and it took only one extra character to trigger the whole problem.

Lesson learned: when facing issues, especially ones dealing with authentication, it's very important to check all names and IDs multiple times and very carefully to make sure they correspond the ones supposed to be used. Thank you to everyone who was helping or trying to help me with this issue!

Aleksandr Blekh
  • 2,462
  • 4
  • 32
  • 64
0

I've simplified the code still further:

library(httr)

base_url  <- "http://srda.cse.nd.edu"

loginURL <- modify_url(
  base_url, 
  path = "mediawiki/index.php", 
  query = list(
    title = "Special:Userlogin", 
    action = "submitlogin",
    type = "login",
    wpName1 = USER,
    wpPasswor1 = PASS
  )
)
r <- POST(loginURL)
stop_for_status(r)

queryURL <- modify_url(base_url, path = "cgi-bin/form.pl")
query <- list(
  uitems       = "user_name",
  utables      = "sf1104.users a, sf1104.artifact b",
  uwhere       = "a.user_id = b.submitted_by AND b.artifact_id = 304727",
  useparator   = ":",
  append_query = "1"
)
r <- POST(queryURL, body = query, multipart = FALSE)
stop_for_status(r)

But I'm still getting a 500. I tried:

  • setting extra cookies that I see in the browser (wiki_dbUserID, wiki_dbUserName)
  • setting header DNT to 1
  • setting referer to http://srda.cse.nd.edu/cgi-bin/form.pl
  • setting user-agent the same as chrome
  • setting accept "text/html"
hadley
  • 102,019
  • 32
  • 183
  • 245
  • Appreciate your help! I expected that the result will be the same regardless of the version. But, I understand your intention to isolate the problem. So, what's the next step, in your opinion, considering that I don't have direct access to the system logs? – Aleksandr Blekh Mar 14 '14 at 01:00
  • 1
    @AleksandrBlekh as the error message says, email the system administrator – hadley Mar 14 '14 at 11:04
  • Hi, Hadley! Finally, I got in touch with person responsible for the system and obtained needed details (please see UPDATE 3). Would appreciate if you could take a look and advise on that! – Aleksandr Blekh Mar 21 '14 at 20:41
  • 1
    @AleksandrBlekh you might try using `verbose()` from the dev version of httr - it provides more information about what's getting sent in the body of the POST. That might be the problem. – hadley Apr 23 '14 at 14:03
  • Hi, Hadley! Appreciate your advice. I have already figured out this problem. Please see my final answer - you might be surprised what the real reason was. It would be great, if you could take a look at this: http://stackoverflow.com/questions/22372758/subsetting-lists-via-index-vectors. Should be piece of cake for you ;-). – Aleksandr Blekh Apr 24 '14 at 02:56
0

The following provides clarification for the scenario (error situation).

From W3C RFC 2616 - HTTP/1.1 Specification:

10.5 Server Error 5xx

Response status codes beginning with the digit "5" indicate cases in which the server is aware that it has erred or is incapable of performing the request. Except when responding to a HEAD request, the server SHOULD include an entity containing an explanation of the error situation, and whether it is a temporary or permanent condition. User agents SHOULD display any included entity to the user. These response codes are applicable to any request method.

10.5.1 500 Internal Server Error

The server encountered an unexpected condition which prevented it from fulfilling the request.

My interpretation of the paragraph 10.5 is that it implies that there should be a more detailed explanation of the error situation beyond the one provided in paragraph 10.5.1. However, I recognize that it very well may be that the message for status code 500 (paragraph 10.5.1) is considered sufficient. Confirmations for either of interpretations are welcome!

Aleksandr Blekh
  • 2,462
  • 4
  • 32
  • 64