I have a large xml file downloaded from pubmed central, I'm trying to extract all the PMID (around 3 million). I want to extract the elem.text (i.e., 34405992) for the corresponding element tag and attribute shown below, can someone advice on how to get all the pmids using multiprocessing since there are 3 million records, thanks.
article-id | {'pub-id-type': 'pmid'} | 34405992
pmid = []
iterator = ET.iterparse('my_file.xml', events=("end",))
print('ROOT:', root.tag)
for event, elem in iterator:
print(elem.tag, '|', elem.attrib, '|', elem.text)
elem.clear()
root.clear()
Output:
ROOT: pmc-articleset
restricted-by | {} | pmc
processing-meta | {'base-tagset': 'archiving', 'mathml-version': '3.0', 'table-model': 'xhtml', 'tagset-family': 'jats'} |
journal-id | {'journal-id-type': 'nlm-journal-id'} | 2985134R
journal-id | {'journal-id-type': 'pubmed-jr-id'} | 2913
journal-id | {'journal-id-type': 'nlm-ta'} | Chem Rev
journal-id | {'journal-id-type': 'iso-abbrev'} | Chem Rev
journal-title | {} | Chemical reviews
journal-title-group | {} |
issn | {'pub-type': 'ppub'} | 0009-2665
issn | {'pub-type': 'epub'} | 1520-6890
journal-meta | {} |
article-id | {'pub-id-type': 'pmid'} | 34405992
article-id | {'pub-id-type': 'pmc'} | 9148388
article-id | {'pub-id-type': 'doi'} | 10.1021/acs.chemrev.1c00308
article-id | {'pub-id-type': 'manuscript'} | NIHMS1808182