web scraping: expanding list to get children

Question

I am trying to get from this page: https://bioportal.bioontology.org/ontologies/MEDDRA?p=classes&conceptid=10040786 the medDRA codes (which are codes for adverse events) list hidden here:

It is in this element:

When I click I get this list:

Which I could scrape with rvest, to get the medDRA codes encapsulated in the links:

The problem is how to automatically display the list.

When looking at the XHR, I get this request, which open the list:

https://bioportal.bioontology.org/ajax_concepts/MEDDRA/?conceptid=http%3A%2F%2Fpurl.bioontology.org%2Fontology%2FMEDDRA%2F10040786&callback=children&_=1667913401293

But I do not understand the rationale for the last number, so I do not manage to automate the request. Is there another way? How could I proceed to get this data?

Can you access the list elements using xpath without expanding the list? If not, you may want to try RSelenium to expand the list. — zephryl, Nov 08 '22 at 14:05
Alternatively, have you looked at [their API](http://data.bioontology.org/documentation)? — zephryl, Nov 08 '22 at 14:09
@zephryl No I cannot, and I would like to avoid RSelenium if possible — denis, Nov 08 '22 at 15:31
@zephryl no good point, I will have a look. But still interested in a response — denis, Nov 08 '22 at 15:32
The last bit is just an [unix](https://www.unixtimestamp.com/) timestamp. It helps, amongst other things, to potentially avoid being served cached results. It can be excluded or recreated (in you plan to make large numbers of requests within a relatively short time frame) — QHarr, Nov 08 '22 at 22:34
So, make an http request to the ajax endpoint you have identified, either removing or generating the unix timestamp and adding to the end. Parse response with rvest and extract the elements of interest. — QHarr, Nov 08 '22 at 22:42

score 1 · Answer 1 · answered Apr 16 '23 at 23:06

I have been able to extract the numbers with the following code :

library(RSelenium)

shell('docker run -d -p 4446:4444 selenium/standalone-firefox')
remDr <- remoteDriver(remoteServerAddr = "localhost", port = 4446L, browserName = "firefox")
remDr$open()
remDr$navigate("https://bioportal.bioontology.org/ontologies/MEDDRA?p=classes&conceptid=10040786")
remDr$screenshot(TRUE)

Sys.sleep(3)
web_Obj_Plus_Sign <- remDr$findElement('xpath', '/html/body/div[1]/div[2]/div[5]/div/div/div[2]/div/div[2]/div[2]/div[1]/div[2]/div/ul/li/ul/li[2]/img')
web_Obj_Plus_Sign$clickElement()

list_Url <- list()

for(i in 1 : 100)
{
  print(i)
  xpath <- paste0('/html/body/div[1]/div[2]/div[5]/div/div/div[2]/div/div[2]/div[2]/div[1]/div[2]/div/ul/li/ul/li[2]/ul/li[', i * 2, ']/a')
  web_Obj_Link <- tryCatch(remDr$findElement("xpath", xpath), error = function(e) NA)
  
  if(is.na(web_Obj_Link))
  {
    break
    
  }else
  {
    list_Url[[i]] <- web_Obj_Link$getElementAttribute("href")[[1]] 
  }
}

MEDDRA_Number <- unlist(lapply(X = list_Url, FUN = function(x) tail(strsplit(x, "%")[[1]], 1)))
MEDDRA_Number

1] "2F10000318" "2F10000513" "2F10075963" "2F10059136" "2F10049044" "2F10005192" "2F10051548" "2F10007247"
 [9] "2F10074010" "2F10012470" "2F10065259" "2F10060803" "2F10014141" "2F10014199" "2F10015146" "2F10057211"
[17] "2F10021531" "2F10063866" "2F10071367" "2F10050500" "2F10021784" "2F10065487" "2F10054994" "2F10073621"
[25] "2F10076139" "2F10061303" "2F10061304" "2F10051296" "2F10054019" "2F10087209" "2F10069447" "2F10037578"
[33] "2F10037632" "2F10085875" "2F10037888" "2F10069443" "2F10040855" "2F10040872" "2F10042343" "2F10085173"
[41] "2F10055027" "2F10067653" "2F10066047

web scraping: expanding list to get children

1 Answers1