Scraping Javascript-Rendered Content in R from a Webpage without Unique URL

Question

I want to scrape historical results of South African LOTTO draws (especially Total Pool Size, Total Sales, etc.) from the South African National Lottery website. By default one sees links to results for the last ten draws, or one can select a date range to pull up a larger set of links to draws (which will still display only ten per page).

Hovering in the browser over a link e.g. 'LOTTO DRAW 2012' we see javascript:void(); so it is clear that the draw results will be rendered using Javascript. Reading advice on an R Web Scraping Cheat Sheet, I realized that I needed to open Google Chrome Developer tools, then open Network tab, and then click the link to the draw 'LOTTO DRAW 2012'. When I did so, I could see that this url is being called with an initiator

When I right-click on the initiator and select 'Copy Response', I can see the data I need inside a 'drawDetails' object in what appears to be JSON code.

{"code":200,"message":"OK","data":{"drawDetails":{"drawNumber":"2012","drawDate":"2020\/04\/11","nextDrawDate":"2020\/04\/15","ball1":"48","ball2":"6","ball3":"43","ball4":"41","ball5":"25","ball6":"45","bonusBall":"38","div1Winners":"1","div1Payout":"10546013.8","div2Winners":"0","div2Payout":"0","div3Winners":"28","div3Payout":"7676.4","div4Winners":"62","div4Payout":"2751.4","div5Winners":"1389","div5Payout":"206.3","div6Winners":"1872","div6Payout":"133","div7Winners":"28003","div7Payout":"50","div8Winners":"20651","div8Payout":"20","rolloverAmount":"0","rolloverNumber":"0","totalPrizePool":"13280236.5","totalSales":"11610950","estimatedJackpot":"2000000","guaranteedJackpot":"0","drawMachine":"RNG2","ballSet":"RNG","status":"published","winners":52006,"millionairs":1,"gpwinners":"52006","wcwinners":"0","ncwinners":"0","ecwinners":"0","mpwinners":"0","lpwinners":"0","fswinners":"0","kznwinners":"0","nwwinners":"0"},"totalWinnerRecord":{"lottoMillionairs":28716702,"lottoWinners":337285646,"ithubaMillionairs":135763,"ithubaWinners":305615802}},"videoData":[{"id":"1049","listid":"1","parentid":"1","videosource":"youtube","videoid":"chHfFxVi9QI","imageurl":"","title":"LOTTO, LOTTO PLUS 1 AND LOTTO PLUS 2 DRAW 2012 (11 APRIL 2020)","description":"","custom_imageurl":"","custom_title":"","custom_description":"","specialparams":"","lastupdate":"0000-00-00 00:00:00","allowupdates":"1","status":"0","isvideo":"1","link":"https:\/\/www.youtube.com\/watch?v=chHfFxVi9QI","ordering":"10001","publisheddate":"2020-04-11 20:06:17","duration":"182","rating_average":"0","rating_max":"0","rating_min":"0","rating_numRaters":"0","statistics_favoriteCount":"0","statistics_viewCount":"329","keywords":"","startsecond":"0","endsecond":"0","likes":"6","dislikes":"0","commentcount":"0","channel_username":"","channel_title":"","channel_subscribers":"9880","channel_subscribed":"0","channel_location":"","channel_commentcount":"0","channel_viewcount":"0","channel_videocount":"1061","channel_description":"","channel_totaluploadviews":"0","alias":"lotto-lotto-plus-1-and-lotto-plus-2-draw-2012-11-april-2020","rawdata":"","datalink":"https:\/\/www.googleapis.com\/youtube\/v3\/videos?id=chHfFxVi9QI&part=id,snippet,contentDetails,statistics&key=AIzaSyC1Xvk2GUdb_N3UiFtjsgZ-uMviJ_8MFZI"}]}

It is a POST type request, and so I tried to follow this answer, but cannot find onclick values indicating the data submitted with the form. Moreover, the request URL for 'LOTTO DRAW 2012' is identical to that for 'LOTTO DRAW 2011', so there is no unique identifier for the particular draw being passed with the URL itself. Thus it is not clear to me how the unique request for the results of a particular draw is made.

Hence, the smaller question is, given a particular LOTTO draw number or draw date, how does one find out the unique identifier that is used to make the POST request for the data pertaining to that draw specifically?

The larger question is, if one is able to obtain such unique identifiers for all the historical draws, how can one generate the JSON drawDetails object for all the historical draws in turn, or otherwise complete the scraping operation?

click on the particular requests youre interested in that side panel. Then click `Headers` and scroll down. See if there's a `Query Form` or something like that. — chitown88, Apr 13 '20 at 13:34
There is `Form Data` with values `gameName` and `drawNumber`; these together would uniquely identify the draw. Thanks - so that answers the first question. The further question is how to run that request for a given `drawNumber` value from within R, in order to generate the JSON drawDetails object. — Thomas Farrar, Apr 13 '20 at 20:32

score 2 · Accepted Answer · answered Apr 13 '20 at 20:57

You are right - the contents on the page are updated by javascript via an ajax request. The server returns a json string in response to an http POST request. With POST requests, the server's response is determined not only by the url you request, but by the body of the message you send to the server. In this case, your body is a simple form with 3 fields: gameName, which is always LOTTO, isAjax which is always true, and drawNumber, which is the field you want to vary.

If you are using httr, you specify these fields as a named list in the body parameter of the POST function.

Once you have the response for each draw, you will want to parse the json into an R-friendly format such as a list or data frame using a library such as jsonlite. From looking at the structure of this particular json, it makes most sense to extract the component $data$drawDetailsand make that a one-row dataframe. This will allow you to bind several draws together into a single data frame.

Here is a function that does all that for you:

lotto_details <- function(draw_numbers)
{
 do.call("rbind", lapply(draw_numbers, function(x)
 {
   res <- httr::POST(paste0("https://www.nationallottery.co.za/index.php",
                            "?task=results.redirectPageURL&amp;",
                            "Itemid=265&amp;option=com_weaver&amp;",
                            "controller=lotto-history"),
                     body = list(gameName = "LOTTO", drawNumber = x, isAjax = "true"))
   as.data.frame(jsonlite::fromJSON(httr::content(res, "text"))$data$drawDetails)
 }))
}

Which you use like this:

lotto_details(2009:2012)
#>   drawNumber   drawDate nextDrawDate ball1 ball2 ball3 ball4 ball5 ball6
#> 1       2009 2020/04/01   2020/04/04    51    15     7    32    42    45
#> 2       2010 2020/04/04   2020/04/08    43     4    21    24    10     3
#> 3       2011 2020/04/08   2020/04/11    42    43     8    18     2    29
#> 4       2012 2020/04/11   2020/04/15    48     6    43    41    25    45
#>   bonusBall div1Winners div1Payout div2Winners div2Payout div3Winners
#> 1         1           0          0           0          0          21
#> 2        22           0          0           0          0          31
#> 3        34           0          0           0          0          21
#> 4        38           1 10546013.8           0          0          28
#>   div3Payout div4Winners div4Payout div5Winners div5Payout div6Winners
#> 1     8455.3          60     2348.7        1252        189        1786
#> 2     6004.3          71     2080.6        1808      137.3        2352
#> 3     8584.5          60     2384.6        1405      171.1        2079
#> 4     7676.4          62     2751.4        1389      206.3        1872
#>   div6Payout div7Winners div7Payout div8Winners div8Payout rolloverAmount
#> 1      115.2       24664         50       19711         20     3809758.17
#> 2       91.7       35790         50       25981         20     5966533.86
#> 3      100.5       27674         50       21895         20     8055430.87
#> 4        133       28003         50       20651         20              0
#>   rolloverNumber totalPrizePool totalSales estimatedJackpot
#> 1              2     6198036.67    9879655          6000000
#> 2              3     9073426.56   11696905          8000000
#> 3              4    10649716.37   10406895         10000000
#> 4              0     13280236.5   11610950          2000000
#>   guaranteedJackpot drawMachine ballSet    status winners millionairs
#> 1                 0        RNG2     RNG published   47494           0
#> 2                 0        RNG2     RNG published   66033           0
#> 3                 0        RNG2     RNG published   53134           0
#> 4                 0        RNG2     RNG published   52006           1
#>   gpwinners wcwinners ncwinners ecwinners mpwinners lpwinners fswinners
#> 1     47494         0         0         0         0         0         0
#> 2     66033         0         0         0         0         0         0
#> 3     53134         0         0         0         0         0         0
#> 4     52006         0         0         0         0         0         0
#>   kznwinners nwwinners
#> 1          0         0
#> 2          0         0
#> 3          0         0
#> 4          0         0

^{Created on 2020-04-13 by the reprex package (v0.3.0)}

Brilliant, thanks! I had simultaneously arrived at a nearly identical solution, though yours is more elegant — Thomas Farrar, Apr 13 '20 at 22:19

score 0 · Answer 2 · answered Apr 13 '20 at 22:56

The question already has a satisfactory answer (see above) that I've accepted. I simultaneously arrived at a nearly identical solution; I add it here only because it explicitly covers the full range of available draw numbers and will automatically detect the most recent draw number so that the code can be run 'as is' in the future, provided the National Lottery website design remains the same.

theurl <- "https://www.nationallottery.co.za/index.php?task=results.redirectPageURL&amp;Itemid=265&amp;option=com_weaver&amp;controller=lotto-history"
x <- rvest::html_text(xml2::read_html(theurl))
preceding_string <- "LOTTO, LOTTO PLUS 1 AND LOTTO PLUS 2 DRAW "
drawnums <- as.integer(vapply(gregexpr(preceding_string, x)[[1]] + nchar(preceding_string), 
              function(k) substr(x, start = k, stop = k + 3), NA_character_))
drawnumrange <- 1506:max(drawnums)
response <- lapply(drawnumrange, function(d) httr::POST(url = theurl, 
                body = list(gameName = "LOTTO", drawNumber = as.character(d), isAjax = 
                "true"), encode = "form"))
jsondat <- lapply(response, function(r) jsonlite::parse_json(r)$data$drawDetails)
lottotable <- as.data.frame(do.call(rbind, jsondat))
numericcols <- c(1, 4:32, 36:37)
lottotable[numericcols] <- sapply(lottotable[numericcols], as.numeric)
xlsx::write.xlsx2(lottotable[1:37], "lottotable.xlsx", row.names = FALSE)

Scraping Javascript-Rendered Content in R from a Webpage without Unique URL

2 Answers2

Related