I'm trying to loop over a list of 11,500 PubChem CIDs to retrieve the BioAssay results table (when available). For example, for the CID 2965821, this is the table I want to obtain. I only need the rows where activity is "Active".
Following this script, I can only get to the numbers of the active aids, but I can't get the full table with the names of the targets, etc.
Here is the code I have with only one compound:
df <- data.frame("Compound" = 2965821)
df <- as.data.table(df)
#set up progressbar
pb <- txtProgressBar(min = 0, max = nrow(df), style = 3)
#loop through df rows
for(i in 1:nrow(df)){
#update progressbar
setTxtProgressBar(pb, i)
#extract active aids data
data <- fromJSON(readLines(paste0("https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/", df[i,]$Compound, "/aids/JSON/?aids_type=active")))
#extract active aid numbers
compound_active_aid_numbers <- data$InformationList$Information$AID
#add to df
df[i, name := compound_active_aid_numbers]
}
compound_active_aid_numbers
When I asked the PubChem staff for help, they gave me a different URL syntax that uses Structured Data Query (SDQ) agent and that directs me to the table I need. I've integrated this URL in the following loop – the problem is that it is extremely slow.
library(tidyverse)
df <- data.frame("Compound" = 2965821)
df <- as.data.table(df)
data_bioassays <- NULL
for(i in 1:nrow(df)){
url <- paste0('https://pubchem.ncbi.nlm.nih.gov/sdq/sdqagent.cgi?infmt=json&outfmt=csv&query={"download":"*","collection":"bioactivity","where":{"ands":[{"cid":"',df[i,]$Compound,'"}]},"order":["acvalue,asc"],"start":1,"limit":10000000,"downloadfilename":"CID_',df[i,]$Compound,'_bioactivity","nullatbottom":1}')
data_x <- fread(url) %>% filter(activity=="Active")
data_bioassays <- bind_rows(data_bioassays,data_x)
}
data_bioassays
Would you have any suggestions on how to improve this code?
Thank you!