Multi Threading stuck - suspect of error in condition variable

Question

I am testing an Ant Colony Optimisation (ACO) software which runs with multiple threads (1 for each ant created).

Each ACO iteration should wait for all threads to finish, before allowing the next iteration to start. I am doing this with "condition()" from threading module.

Since ants share a pherormone matrix, the reading and writing on that matrix is subject to locks, also from the threading module.

Now a description of the problem:

I run the function and print something at each iteration. Sometimes, not always, it seams that the execution of the function stops, this is, it stops printing, meaning the iteration never finished.

I honestly don't now why this is happening, and i would appreciate any answer that could get me in the right track. If i had to guess I would say that the condition variable is not properly called, or something like that. However I am not sure, and I also find it odd that this only happens sometimes.

Below are the relevant functions. The ACO starts by calling the start() function. This creates N threads, which, when finished, call update(). This update function, upon being called N times, calls notify, which allows start() to continue the process and, finally, start the next iteration. I also posted the run method of each thread.

It may be worth mentioning that, without daemon actions, the error hardly occurs. With daemon actions, it occurs almost always (which i also find odd). Finally, the error does not happen always in the same iteration.

    def start(self):
        self.ants = self.create_ants()
        self.iter_counter = 0

        while self.iter_counter < self.num_iterations:
            print "START ACQUIRED"
            self.cv.acquire()
            print "calling iteration"
            self.iteration()
            #CV wait until all ants (threads) finish and call update, which
            #calls notify(), and allow continuation
            while not self.iter_done:
                print "iter not complete, W8ING"
                self.cv.wait()
            print "global update "
            self.global_update_with_lock()
            print "START RELEASED"
            self.cv.release()

def update(self, ant):
    lock = Lock()
    lock.acquire()

    print "Update called by %s" % (ant.ID,)
    self.ant_counter += 1

    self.avg_path_cost += ant.path_cost

    # book-keeping
    if ant.path_cost < self.best_path_cost:
        self.best_path_cost = ant.path_cost
        self.best_path_mat = ant.path_mat
        self.best_path_vec = ant.path_vec
        self.last_best_path_iteration = self.iter_counter

    #all threads finished, call notify
    print "ant counter"
    print self.ant_counter
    if self.ant_counter == len(self.ants):
        print "ants finished"
        #THIS MIGHT CAUSE PROBLEMS (no need to notify if its no one waiting)
        self.best_cost_at_iter.append(self.best_path_cost)
        self.avg_path_cost /= len(self.ants)

        self.cv.acquire()
        self.iter_done = True
        self.cv.notify()
        self.cv.release()

    lock.release()

    # overide Thread's run()
    def run(self):
        graph = self.colony.graph
        while not self.end():
            # we need exclusive access to the graph
            graph.lock.acquire()
            new_node = self.state_transition_rule(self.curr_node)
            self.path_cost += graph.delta(self.curr_node, new_node)

            self.path_vec.append(new_node)
            self.path_mat[self.curr_node][new_node] = 1  #adjacency matrix representing path

            #print "Ant %s : %s, %s" % (self.ID, self.path_vec, self.path_cost,)

            self.local_updating_rule(self.curr_node, new_node)
            graph.lock.release()

            self.curr_node = new_node

        # close the tour
        self.path_vec.append(self.path_vec[0])

        #RUN LOCAL HEURISTIC
        if self.daemon == True:
            try:
                daemon_result =  twoOpt(self.path_vec, graph.delta_mat)
                d_path, d_adj = daemon_result['path_vec'], daemon_result['path_matrix']
                self.path_vec = d_path
                self.path_mat = d_adj
            except Exception, e:
                print "exception: " + str(e)
                traceback.print_exc()

        self.path_cost += graph.delta(self.path_vec[-2], self.path_vec[-1])
        # send our results to the colony
        self.colony.update(self)
        #print "Ant thread %s terminating." % (self.ID,)
        
        # allows thread to be restarted (calls Thread.__init__)
        self.__init__(self.ID, self.start_node, self.colony, self.daemon, self.Beta, self.Q0, self.Rho)

Solution to the problem: First of all, i corrected the error in the waiting of the condition variables, in according to the comments here. Second, it was still hanging sometimes, and this was due to a somewhat buggy mistake in the thread counter update. The solution was to change the counter from an int, to an array with length num_threads, full of 0's, where each thread updates its position in the list. When all threads finish, the counter array should be all 1's. This is currently working just fine.

If you wait for something that has already happened, you're going to be waiting a *very* long time. **Never** call `wait` on a condition variable unless you have checked that the thing you are waiting for hasn't already happened. Also, you manipulate shared state before calling `acquire`. See [here](https://docs.python.org/2.0/lib/condition-objects.html) for proper use. Notice the `acquire` before the `wait` which is in a `while` loop. Notice the `acquire` before shared state is modified. — David Schwartz, Nov 23 '17 at 00:24
My sequence of actions is 1) self.iterations() (this is a function that simply starts #num_ants threads), 2) self.acquire(), 3) self.wait(). So if I am waiting for something that already hapened, does that mean that between 1 and 3) all threads have already terminated, and this is why no one notifies? @DavidSchwartz — Rafael Marques, Nov 23 '17 at 02:58
after correction based on your answer and link, i still got the mistake. furter investigation lead to this. I add a print in the update function, which prints the current thread_counter. Wen error occurs, the update printed twice the same thread_counter, thus, not reaching the termination condition that calls notify. this is a part of the tring where the error ocured: "ant counter 3, Update called by 6, ant counter 4, Update called by 19, Update called by 4, ant counter ant counter 5 5 Update called by 14 ant counter 6" — Rafael Marques, Nov 23 '17 at 04:10
Can you show us the corrected code? It's quite possible you didn't correct it properly. For example, you may still modify or access shared state without protection or call `wait` when you shouldn't. — David Schwartz, Nov 23 '17 at 07:45

Multi Threading stuck - suspect of error in condition variable

0 Answers0