0

I have a large xml file downloaded from pubmed central, I'm trying to extract all the PMID (around 3 million). I want to extract the elem.text (i.e., 34405992) for the corresponding element tag and attribute shown below, can someone advice on how to get all the pmids using multiprocessing since there are 3 million records, thanks.

article-id | {'pub-id-type': 'pmid'} | 34405992

pmid = []
iterator = ET.iterparse('my_file.xml', events=("end",))  
print('ROOT:', root.tag)
for event, elem in iterator:  
    print(elem.tag, '|', elem.attrib, '|', elem.text)
    elem.clear() 
    root.clear() 


Output:

    ROOT: pmc-articleset
    restricted-by | {} | pmc
    processing-meta | {'base-tagset': 'archiving', 'mathml-version': '3.0', 'table-model': 'xhtml', 'tagset-family': 'jats'} | 
        
    journal-id | {'journal-id-type': 'nlm-journal-id'} | 2985134R
    journal-id | {'journal-id-type': 'pubmed-jr-id'} | 2913
    journal-id | {'journal-id-type': 'nlm-ta'} | Chem Rev
    journal-id | {'journal-id-type': 'iso-abbrev'} | Chem Rev
    journal-title | {} | Chemical reviews
    journal-title-group | {} | 
            
    issn | {'pub-type': 'ppub'} | 0009-2665
    issn | {'pub-type': 'epub'} | 1520-6890
    journal-meta | {} | 
          
    article-id | {'pub-id-type': 'pmid'} | 34405992
    article-id | {'pub-id-type': 'pmc'} | 9148388
    article-id | {'pub-id-type': 'doi'} | 10.1021/acs.chemrev.1c00308
    article-id | {'pub-id-type': 'manuscript'} | NIHMS1808182
Mathew
  • 61
  • 1
  • 8

1 Answers1

0

I was able to figure it out, though couldn't make use of multiprocessing.

data = []

for event, elem in ET.iterparse('my_file.xml'):
    if elem.tag == "article-id":
        contents = ET.tostring(elem)
        soup = BeautifulSoup(contents,'xml')
        input_tag  = soup.find_all(attrs = {'pub-id-type': 'pmid'})
        for i in input_tag:
            data.append(i.contents[0])
    elem.clear()
Mathew
  • 61
  • 1
  • 8