I have a collection of 286484 documents, collection
, each one of which contains many fields, but in particular they all contain a title
and a pmid
. I wish to get a dictionary that maps pmids to titles.
I expected this to be approximately instantaneous given the moderate amount of data. Instead, the below code reports a runtime of approximately 330.1 sec.
start = time.perf_counter()
papers = collection.find(projection={"_id": False, "title": True, "pmid": True})
papers2 = {paper["pmid"]: paper["title"] for paper in papers}
stop = time.perf_counter()
print(f"elapsed time: {stop - start} sec")
Why does this take so long, and how do I speed it up?
Other relevant facts:
- I'm running in Python 3.7.6 on linux with pymongo 3.12.1 and MongoDB 4.4.0.
- I've verified that the projection works correctly (i.e. it returns
pmid
andtitle
and nothing else). - This is all running on a single cloud machine (i.e. database and code on same machine, no sharding). It's not particularly high powered, but there's free memory and no other simultaneous users.
pmid
is indexed;title
is not.explain
doesn't really help here because there's no filter. ThewinningPlan
isPROJECTION_SIMPLE
and there are norejectedPlans
. A possible clue: callingexplain
on the cursor took 440 sec.