1

I have written a program that I am using to benchmark a mongodb database performing under multithreaded bulk write conditions.

The problem is that the program hangs and does not finish executing.

I am quite sure that the problem is due to writing 530838 records to the database and using 10 threads to bulk write 50 records at a time. This leaves a modulo value of 38 records, however the run method fetches 50 records from the queue so the process hangs when 530800 records have been written and never writes the final 38 records as the following code never finishes executing

for object in range(50): objects.append(self.queue.get())

I would like the program to write 50 records at a time until fewer than 50 remain at which point it should write the remaining records in the queue and then exit the thread when no records remain in the queue.

Thanks in advance :)

import threading
import Queue
import json
from pymongo import MongoClient, InsertOne
import datetime

#Set the number of threads
n_thread = 10
#Create the queue
queue = Queue.Queue()

#Connect to the database
client = MongoClient("mongodb://mydatabase.com")
db = client.threads

class ThreadClass(threading.Thread):

    def __init__(self, queue):
        threading.Thread.__init__(self)
    #Assign thread working with queue
        self.queue = queue


    def run(self):

        while True:
            objects = []

        #Get next 50 objects from queue
            for object in range(50):
                objects.append(self.queue.get())

                #Insert the queued objects into the database           
            db.threads.insert_many(objects)

            #signals to queue job is done
            self.queue.task_done()


#Create number of processes
threads = []

for i in range(n_thread):
    t = ThreadClass(queue)
    t.setDaemon(True)
    #Start thread
    t.start()

#Start timer
starttime = datetime.datetime.now()

#Read json object by object
content = json.load(open("data.txt","r"))
for jsonobj in content:
    #Put object into queue
    queue.put(jsonobj)
#wait on the queue until everything has been processed 
queue.join()

for t in threads:
    t.join()

#Print the total execution time
endtime = datetime.datetime.now()
duration = endtime-starttime
print(divmod(duration.days * 86400 + duration.seconds, 60))
Lucas Amos
  • 1,117
  • 4
  • 15
  • 36

1 Answers1

0

From the docs on Queue.get you can see that the default settings are block=True and timeout=None, which results in blocked waiting on an empty queue to have a next item that can be taken.

You could use get_nowait or get(False) to ensure you're not blocking. If you want the blocking to be conditional on whether the queue has 50 items, whether it is empty, or other conditions, you can use Queue.empty and Queue.qsize, but note that they do not provide race-condition-proof guarantees of non-blocking behavior... they would merely be heuristics for whether to use block=False with get.

Something like this:

def run(self):

    while True:
        objects = []

        #Get next 50 objects from queue
        block = self.queue.qsize >= 50
        for i in range(50):
            try:
                item = self.queue.get(block=block)
            except Queue.Empty:
                break
            objects.append(item)

        #Insert the queued objects into the database           
        db.threads.insert_many(objects)

        #signals to queue job is done
        self.queue.task_done()

Another approach would be to set timeout and use a try ... except block to catch any Empty exceptions that are raised. This has the advantage that you can decide how long to wait, rather than heuristically guessing when to immediately return, but they are similar.

Also note that I changed your loop variable from object to i ... you should most likely avoid having your loop variable ghost the global object class.

ely
  • 74,674
  • 34
  • 147
  • 228