Update: If you read this to improve your insert/update speeds, check if you have pymongo C extensions enabled on your system by running
pymongo.has_c()
from a python console first. If it resolves toFalse
, you either need to compile pymongo with C extensions or dopip install --upgrade pymongo
It improved my workflow from 17 seconds on 10K rows to about 0.57 seconds.
I have thousands of txt files, containing millions of rows of data that I am trying to import into a mongodb collection.
I'm currently using the following def:
import re, pymongo
coll = pymongo.MongoClient().database.collection
rx = re.compile(r'[:; ]')
rx_email = re.compile(r'\S+@\S+\.\S+$')
def parser(path):
with open(path, "rb") as f:
for line in f:
try:
fields = rx.split(line.decode('utf-8'))
email = ''
username = ''
for field in fields:
if rx_email.match(field):
email = field
elif field != fields[-1]:
username = field
password = fields[-1]
if email:
coll.find_one_and_update({'email': email}, {'$addToSet': {'passwords': password}}, upsert=True)
elif username:
coll.find_one_and_update({'username': username}, {'$addToSet': {'passwords': password}}, upsert=True)
else:
pass
except UnicodeDecodeError:
pass
if __name__ == "__main__":
parser('path/to/file.txt')
When I try to run the script on a file with 10K rows, it took 74.58974479999999 seconds. I assume this is due to the amount of items MongoDB has to match against when I insert? Running the same loop without db interaction took 0.022998 seconds.
EDIT: As suggested in Fast or Bulk Upsert in pymongo, I have also tried to use UpdateOne
with bulk_write
as follows:
def parser(path):
ops = []
with open(path, "rb") as f:
for line in f:
if (len(ops) == 1000):
LOCAL_DB.bulk_write(ops, ordered=False)
ops = []
try:
fields = rx.split(line.decode('utf-8'))
email = ''
username = ''
for field in fields:
if rx_email.match(field):
email = field
elif field != fields[-1]:
username = field
password = fields[-1]
if email:
pass
ops.append((UpdateOne({'identifier': email}, {'$addToSet': {'passwords': password}}, upsert=True)))
elif username:
pass
ops.append((UpdateOne({'identifier': username}, {'$addToSet': {'passwords': password}}, upsert=True)))
else:
pass
except UnicodeDecodeError:
pass
Time to complete 10K lines is 17 seconds, which is however way to slow for the amount of files and rows I am trying to update.
Are there any better (and hopefully faster) ways of doing this?
Some requirements:
- email and/or username should be unique.
- The array containing passwords should only list each password once (also unique).
- 1M rows should (if possible) take less than 1 minute to insert.