0

I need to collect studies from a website than can only display 1000 studies at a time in xml. I have a case where there are more than 1000 studies and therefore I need to iterate until I have everything. In my case, I try to do this with a recursive function. I know that it is not the only way but I am so close that I would like to know what is going wrong here.

from xml.etree import ElementTree as ET

def collect_ct_ids(nct_ids=[], start=1):    
    url = f'https://www.clinicaltrials.gov/ct2/results?&type=Intr&intr=%22NICOTINE+BITARTRATE%22%20OR%20%22PROSTEP%22%20OR%20%22NICOTROL%22%20OR%20%22NICOTINE+POLACRILEX%22%20OR%20%22NICORETTE%22%20OR%20%22NICODERM+CQ%22%20OR%20%22HABITROL%22%20OR%20%22NICOTINE%22&down_fmt=csv&down_flds=all?displayxml=true&start={start}&count=10000'
    print(url)

    # read xml
    response = requests.get(url)
    root = ET.fromstring(response.content)

    for child in root.iter('search_results'):
        n_studies = int(child.attrib['count'])

    # store studies
    nct_ids = nct_ids + [x.find('nct_id').text for x in root.findall('clinical_study')]

    print(n_studies, len(nct_ids))

    ## check that the default url allows to collect all the studies
    if n_studies == len(nct_ids):
        # return the studies
        print('if')
        print(nct_ids)
        return(nct_ids)   

    else:
        # reiterate until every study has been collected
        print('else')
        start += 1000
        return(nct_ids.append(collect_ct_ids(nct_ids, start)))

Basically, when the function should return a list with 1342 studies, it return None, despite the nct_ids object containing the 1342 studies.

Any help welcome.

Edit2:

The solution is to use .__add__ to add elements in the list and also no need to add to the list in the second return or it duplicates elements.

from xml.etree import ElementTree as ET

def collect_ct_ids(nct_ids=[], start=1):    
    url = f'https://www.clinicaltrials.gov/ct2/results?&type=Intr&intr=%22NICOTINE+BITARTRATE%22%20OR%20%22PROSTEP%22%20OR%20%22NICOTROL%22%20OR%20%22NICOTINE+POLACRILEX%22%20OR%20%22NICORETTE%22%20OR%20%22NICODERM+CQ%22%20OR%20%22HABITROL%22%20OR%20%22NICOTINE%22&down_fmt=csv&down_flds=all?displayxml=true&start={start}&count=1000'
    print(url)

    # read xml
    response = requests.get(url)
    root = ET.fromstring(response.content)

    for child in root.iter('search_results'):
        n_studies = int(child.attrib['count'])

    # store studies
    nct_ids = nct_ids.__add__([x.find('nct_id').text for x in root.findall('clinical_study')])

    print(n_studies, len(nct_ids))

    ## check that the default url allows to collect all the studies
    if n_studies == len(nct_ids):
        # return the studies
        print('if')
        #print(nct_ids)
        return(nct_ids)   

    else:
        # reiterate until every study has been collected
        print('else')
        start += 1000
        return(nct_ids.__add__(collect_ct_ids(nct_ids, start)))
Karl Knechtel
  • 62,466
  • 11
  • 102
  • 153
Despe1990
  • 595
  • 1
  • 3
  • 21
  • @blhsing: I don't get how my issue is related to the one suggested. In my case the object nct_ids is not empty as the print shows. But the return does not seem to see it. – Despe1990 Dec 04 '19 at 21:18
  • Please fix `return(nct_ids.append(collect_ct_ids(nct_ids, start)))` according to the answer in the linked question. – blhsing Dec 04 '19 at 21:25
  • 1
    Saw that indeed. Thanks – Despe1990 Dec 04 '19 at 21:27

0 Answers0