2

Update: If you read this to improve your insert/update speeds, check if you have pymongo C extensions enabled on your system by running pymongo.has_c() from a python console first. If it resolves to False, you either need to compile pymongo with C extensions or do pip install --upgrade pymongo

It improved my workflow from 17 seconds on 10K rows to about 0.57 seconds.

I have thousands of txt files, containing millions of rows of data that I am trying to import into a mongodb collection.

I'm currently using the following def:

import re, pymongo
coll = pymongo.MongoClient().database.collection
rx = re.compile(r'[:; ]')
rx_email = re.compile(r'\S+@\S+\.\S+$')

def parser(path):
    with open(path, "rb") as f:
        for line in f:
            try:
                fields = rx.split(line.decode('utf-8'))
                email = ''
                username = ''
                for field in fields:
                    if rx_email.match(field):
                        email = field
                    elif field != fields[-1]:
                        username = field
                password = fields[-1]
                if email:
                    coll.find_one_and_update({'email': email}, {'$addToSet': {'passwords': password}}, upsert=True)
                elif username:
                    coll.find_one_and_update({'username': username}, {'$addToSet': {'passwords': password}}, upsert=True)
                else:
                    pass
            except UnicodeDecodeError:
                pass

if __name__ == "__main__":
    parser('path/to/file.txt')

When I try to run the script on a file with 10K rows, it took 74.58974479999999 seconds. I assume this is due to the amount of items MongoDB has to match against when I insert? Running the same loop without db interaction took 0.022998 seconds.

EDIT: As suggested in Fast or Bulk Upsert in pymongo, I have also tried to use UpdateOne with bulk_write as follows:

def parser(path):
    ops = []
    with open(path, "rb") as f:
        for line in f:
            if (len(ops) == 1000):
                LOCAL_DB.bulk_write(ops, ordered=False)
                ops = []
            try:
                fields = rx.split(line.decode('utf-8'))
                email = ''
                username = ''
                for field in fields:
                    if rx_email.match(field):
                        email = field
                    elif field != fields[-1]:
                        username = field
                password = fields[-1]
                if email:
                    pass
                    ops.append((UpdateOne({'identifier': email}, {'$addToSet': {'passwords': password}}, upsert=True)))
                elif username:
                    pass
                    ops.append((UpdateOne({'identifier': username}, {'$addToSet': {'passwords': password}}, upsert=True)))
                else:
                    pass
            except UnicodeDecodeError:
                pass

Time to complete 10K lines is 17 seconds, which is however way to slow for the amount of files and rows I am trying to update.

Are there any better (and hopefully faster) ways of doing this?

Some requirements:

  1. email and/or username should be unique.
  2. The array containing passwords should only list each password once (also unique).
  3. 1M rows should (if possible) take less than 1 minute to insert.
Community
  • 1
  • 1
Ole Aldric
  • 725
  • 6
  • 22
  • 1
    Possible duplicate of [Fast or Bulk Upsert in pymongo](https://stackoverflow.com/questions/5292370/fast-or-bulk-upsert-in-pymongo) – JohnnyHK Feb 04 '19 at 16:36
  • I don't think that solution is what I am looking for. Or at least not in the form it is presented there. – Ole Aldric Feb 04 '19 at 17:11
  • Do you have a unique index on `identifier`? – JohnnyHK Feb 04 '19 at 17:36
  • Yes I do (Ascending), If I try with larger files (say 1M rows), after 34200 inserts it just stops. – Ole Aldric Feb 04 '19 at 17:37
  • Then that's likely as fast as you can make it. Although it stopping after 34200 inserts may indicate a bug in your code somewhere. – JohnnyHK Feb 04 '19 at 17:44
  • I was hoping to find something around pre-encoding it to BSON before insertion could speed this up even further, but I haven't found much information other than "users shouldn't normally need to encode to BSON" – Ole Aldric Feb 04 '19 at 17:48
  • 1
    I found out that `pymongo.has_c()` resolved to false on my system, after doing `pip install --upgrade pymongo`, I got 10K lines down to 0.57 seconds (also making sure that C extensions was `True` aftwerwords.). – Ole Aldric Feb 04 '19 at 18:51
  • That's great! Go ahead and add that as an answer (and accept it). Or add a new answer to the linked dupe. – JohnnyHK Feb 04 '19 at 19:01

1 Answers1

1

It seems I managed, with some guidance by @JohnnyHK in the comments, to get my initial upsert time from ~74 to ~0.5 seconds for 10K rows by doing the following to my initial code:

import re, pymongo
rx = re.compile(r'[:; ]')
rx_email = re.compile(r'\S+@\S+\.\S+$')

def parse(path):
    ops = []
    with open(path, "rb") as f:
        for line in f:
            if (len(ops) == 1000):
                pymongo.MongoClient().database.collection.bulk_write(ops, ordered=False)
                ops = []
            try:
                fields = rx.split(line.decode('utf-8'))
                email = ''
                username = ''
                for field in fields:
                    if rx_email.match(field):
                        email = field
                    elif field != fields[-1]:
                        username = field
                password = fields[-1]
                if email:
                    ops.append((pymongo.UpdateOne({'_id': email}, {'$addToSet': {'passwords': password}}, upsert=True)))
                elif username:
                    ops.append((pymongo.UpdateOne({'_id': username}, {'$addToSet': {'passwords': password}}, upsert=True)))
                else:
                    pass # logic removed
            except UnicodeDecodeError:
                pass # logic removed

if __name__ == "__main__":
    parse(path/to/file.txt)

I found that pymongo C extensions was missing on my system:

>>> import pymongo
>>> pymongo.has_c()
>>> False

From there I did pip install --upgrade pymongo (luckily for me) and it resolved to True

I also used _id instead of identifier for the unique fields, which improved speed even more.

Hope this helps people down the road. I will update with more findings as I learn.

Ole Aldric
  • 725
  • 6
  • 22