programmatic access to pubchem bioassay data in R

Question

I'm trying to loop over a list of 11,500 PubChem CIDs to retrieve the BioAssay results table (when available). For example, for the CID 2965821, this is the table I want to obtain. I only need the rows where activity is "Active".

Following this script, I can only get to the numbers of the active aids, but I can't get the full table with the names of the targets, etc.

Here is the code I have with only one compound:

df <- data.frame("Compound" = 2965821)
df <- as.data.table(df)

#set up progressbar
pb <- txtProgressBar(min = 0, max = nrow(df), style = 3)

#loop through df rows
for(i in 1:nrow(df)){
  #update progressbar
  setTxtProgressBar(pb, i)  
  
  #extract active aids data 
  data <- fromJSON(readLines(paste0("https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/", df[i,]$Compound, "/aids/JSON/?aids_type=active")))

  #extract active aid numbers
  compound_active_aid_numbers <- data$InformationList$Information$AID
  
  #add to df
  df[i, name := compound_active_aid_numbers]
}

compound_active_aid_numbers

When I asked the PubChem staff for help, they gave me a different URL syntax that uses Structured Data Query (SDQ) agent and that directs me to the table I need. I've integrated this URL in the following loop – the problem is that it is extremely slow.

library(tidyverse)

df <- data.frame("Compound" = 2965821)
df <- as.data.table(df)
data_bioassays <- NULL
for(i in 1:nrow(df)){
url <- paste0('https://pubchem.ncbi.nlm.nih.gov/sdq/sdqagent.cgi?infmt=json&outfmt=csv&query={"download":"*","collection":"bioactivity","where":{"ands":[{"cid":"',df[i,]$Compound,'"}]},"order":["acvalue,asc"],"start":1,"limit":10000000,"downloadfilename":"CID_',df[i,]$Compound,'_bioactivity","nullatbottom":1}')
  data_x <- fread(url) %>% filter(activity=="Active")
  data_bioassays <- bind_rows(data_bioassays,data_x) 
}

data_bioassays

Would you have any suggestions on how to improve this code?

Thank you!

move the bind_rows outside the loop - initialise a list and save each iteration as an object in the list, then bind rows afterwards. Each time you `bind_rows()` a complete copy of the object is created, so over multiple iterations this slows things down a lot. — Paul Stafford Allen, Jan 13 '23 at 15:26
ok, thank you! I'll modify this. I'm afraid though that what is really slowing this down is this line: data_x <- fread(url) %>% filter(activity=="Active") — xsrt, Jan 13 '23 at 16:41

programmatic access to pubchem bioassay data in R

0 Answers0