I need to collect studies from a website than can only display 1000 studies at a time in xml. I have a case where there are more than 1000 studies and therefore I need to iterate until I have everything. In my case, I try to do this with a recursive function. I know that it is not the only way but I am so close that I would like to know what is going wrong here.
from xml.etree import ElementTree as ET
def collect_ct_ids(nct_ids=[], start=1):
url = f'https://www.clinicaltrials.gov/ct2/results?&type=Intr&intr=%22NICOTINE+BITARTRATE%22%20OR%20%22PROSTEP%22%20OR%20%22NICOTROL%22%20OR%20%22NICOTINE+POLACRILEX%22%20OR%20%22NICORETTE%22%20OR%20%22NICODERM+CQ%22%20OR%20%22HABITROL%22%20OR%20%22NICOTINE%22&down_fmt=csv&down_flds=all?displayxml=true&start={start}&count=10000'
print(url)
# read xml
response = requests.get(url)
root = ET.fromstring(response.content)
for child in root.iter('search_results'):
n_studies = int(child.attrib['count'])
# store studies
nct_ids = nct_ids + [x.find('nct_id').text for x in root.findall('clinical_study')]
print(n_studies, len(nct_ids))
## check that the default url allows to collect all the studies
if n_studies == len(nct_ids):
# return the studies
print('if')
print(nct_ids)
return(nct_ids)
else:
# reiterate until every study has been collected
print('else')
start += 1000
return(nct_ids.append(collect_ct_ids(nct_ids, start)))
Basically, when the function should return a list with 1342 studies, it return None, despite the nct_ids object containing the 1342 studies.
Any help welcome.
Edit2:
The solution is to use .__add__ to add elements in the list and also no need to add to the list in the second return or it duplicates elements.
from xml.etree import ElementTree as ET
def collect_ct_ids(nct_ids=[], start=1):
url = f'https://www.clinicaltrials.gov/ct2/results?&type=Intr&intr=%22NICOTINE+BITARTRATE%22%20OR%20%22PROSTEP%22%20OR%20%22NICOTROL%22%20OR%20%22NICOTINE+POLACRILEX%22%20OR%20%22NICORETTE%22%20OR%20%22NICODERM+CQ%22%20OR%20%22HABITROL%22%20OR%20%22NICOTINE%22&down_fmt=csv&down_flds=all?displayxml=true&start={start}&count=1000'
print(url)
# read xml
response = requests.get(url)
root = ET.fromstring(response.content)
for child in root.iter('search_results'):
n_studies = int(child.attrib['count'])
# store studies
nct_ids = nct_ids.__add__([x.find('nct_id').text for x in root.findall('clinical_study')])
print(n_studies, len(nct_ids))
## check that the default url allows to collect all the studies
if n_studies == len(nct_ids):
# return the studies
print('if')
#print(nct_ids)
return(nct_ids)
else:
# reiterate until every study has been collected
print('else')
start += 1000
return(nct_ids.__add__(collect_ct_ids(nct_ids, start)))