0

I downloaded a zip file from https://clinicaltrials.gov/AllPublicXML.zip, which contains over 200k xml files (most are < 10 kb in size), to a directory (see 'dirpath_zip' in the CODE) I created in ubuntu 16.04 (using DigitalOcean). What I'm trying to accomplish is loading all of these into MongoDB (also installed in the same location as the zip file).

I ran the CODE below twice and consistently failed when processing the 15988th file.

I've googled around and tried reading other posts regarding this particular error, but couldn't find a way to solve this particular issue. Actually, I'm not really sure what problem really is... any help is much appreciated!!

CODE:

import re
import json
import zipfile
import pymongo
import datetime
import xmltodict
from bs4 import BeautifulSoup
from pprint import pprint as ppt


def timestamper(stamp_type="regular"):
    if stamp_type == "regular":
        timestamp = str(datetime.datetime.now())
    elif stamp_type == "filename":
        timestamp = str(datetime.datetime.now()).replace("-", "").replace(":", "").replace(" ", "_")[:15]
    else:
        sys.exit("ERROR [timestamper()]: unexpected 'stamp_type' (parameter) encountered")
    return timestamp


client = pymongo.MongoClient()
db = client['ctgov']
coll_name = "ts_"+timestamper(stamp_type="filename")
coll = db[coll_name]

dirpath_zip = '/glbdat/ctgov/all/alltrials_20180402.zip'
z = zipfile.ZipFile(dirpath_zip, 'r')
i = 0
for xmlfile in z.namelist():
    print(i, 'parsing:', xmlfile)
    if xmlfile == 'Contents.txt':
        print(xmlfile, '==> entering "continue"')
        continue
    else:
        soup = BeautifulSoup(z.read(xmlfile), 'lxml')

        json_study = json.loads(re.sub('\s', ' ', json.dumps(xmltodict.parse(str(soup.find('clinical_study'))))).strip())

        coll.insert_one(json_study)

        i+=1

ERROR MESSAGE:

Traceback (most recent call last):
  File "zip_to_mongo_alltrials.py", line 38, in <module>
    soup = BeautifulSoup(z.read(xmlfile), 'lxml')
  File "/usr/local/lib/python3.5/dist-packages/bs4/__init__.py", line 225, in __init__
    markup, from_encoding, exclude_encodings=exclude_encodings)):
  File "/usr/local/lib/python3.5/dist-packages/bs4/builder/_lxml.py", line 118, in prepare_markup
    for encoding in detector.encodings:
  File "/usr/local/lib/python3.5/dist-packages/bs4/dammit.py", line 264, in encodings
    self.chardet_encoding = chardet_dammit(self.markup)
  File "/usr/local/lib/python3.5/dist-packages/bs4/dammit.py", line 34, in chardet_dammit
    return chardet.detect(s)['encoding']
  File "/usr/lib/python3/dist-packages/chardet/__init__.py", line 30, in detect
    u.feed(aBuf)
  File "/usr/lib/python3/dist-packages/chardet/universaldetector.py", line 128, in feed
    if prober.feed(aBuf) == constants.eFoundIt:
  File "/usr/lib/python3/dist-packages/chardet/charsetgroupprober.py", line 64, in feed
    st = prober.feed(aBuf)
  File "/usr/lib/python3/dist-packages/chardet/hebrewprober.py", line 224, in feed
    aBuf = self.filter_high_bit_only(aBuf)
  File "/usr/lib/python3/dist-packages/chardet/charsetprober.py", line 53, in filter_high_bit_only
    aBuf = re.sub(b'([\x00-\x7F])+', b' ', aBuf)
  File "/usr/lib/python3.5/re.py", line 182, in sub
    return _compile(pattern, flags).sub(repl, string, count)
MemoryError
akondo
  • 65
  • 2
  • 6
  • You should try to `del` your variables after use and add some manual garbage collection. Also try to manually decompress before and process that problematic file alone. – Klaus D. Apr 03 '18 at 06:59

1 Answers1

0

Try to push reading from file and inserting into db in another method. Also add gc.collect() for garbage collection.

    import gc;
    def read_xml_insert(xmlfile):
        soup = BeautifulSoup(z.read(xmlfile), 'lxml')
        json_study = json.loads(re.sub('\s', ' ', json.dumps(xmltodict.parse(str(soup.find('clinical_study'))))).strip())
        coll.insert_one(json_study)

    for xmlfile in z.namelist():
        print(i, 'parsing:', xmlfile)
        if xmlfile == 'Contents.txt':
             print(xmlfile, '==> entering "continue"')
             continue;
        else:
          read_xml_insert(xmlfile);
          i+=1
        gc.collect()



   `

Please see.

your_free
  • 16
  • 3