I am testing an Ant Colony Optimisation (ACO) software which runs with multiple threads (1 for each ant created).
Each ACO iteration should wait for all threads to finish, before allowing the next iteration to start. I am doing this with "condition()" from threading module.
Since ants share a pherormone matrix, the reading and writing on that matrix is subject to locks, also from the threading module.
Now a description of the problem:
I run the function and print something at each iteration. Sometimes, not always, it seams that the execution of the function stops, this is, it stops printing, meaning the iteration never finished.
I honestly don't now why this is happening, and i would appreciate any answer that could get me in the right track. If i had to guess I would say that the condition variable is not properly called, or something like that. However I am not sure, and I also find it odd that this only happens sometimes.
Below are the relevant functions. The ACO starts by calling the start() function. This creates N threads, which, when finished, call update(). This update function, upon being called N times, calls notify, which allows start() to continue the process and, finally, start the next iteration. I also posted the run method of each thread.
It may be worth mentioning that, without daemon actions, the error hardly occurs. With daemon actions, it occurs almost always (which i also find odd). Finally, the error does not happen always in the same iteration.
def start(self):
self.ants = self.create_ants()
self.iter_counter = 0
while self.iter_counter < self.num_iterations:
print "START ACQUIRED"
self.cv.acquire()
print "calling iteration"
self.iteration()
#CV wait until all ants (threads) finish and call update, which
#calls notify(), and allow continuation
while not self.iter_done:
print "iter not complete, W8ING"
self.cv.wait()
print "global update "
self.global_update_with_lock()
print "START RELEASED"
self.cv.release()
def update(self, ant):
lock = Lock()
lock.acquire()
print "Update called by %s" % (ant.ID,)
self.ant_counter += 1
self.avg_path_cost += ant.path_cost
# book-keeping
if ant.path_cost < self.best_path_cost:
self.best_path_cost = ant.path_cost
self.best_path_mat = ant.path_mat
self.best_path_vec = ant.path_vec
self.last_best_path_iteration = self.iter_counter
#all threads finished, call notify
print "ant counter"
print self.ant_counter
if self.ant_counter == len(self.ants):
print "ants finished"
#THIS MIGHT CAUSE PROBLEMS (no need to notify if its no one waiting)
self.best_cost_at_iter.append(self.best_path_cost)
self.avg_path_cost /= len(self.ants)
self.cv.acquire()
self.iter_done = True
self.cv.notify()
self.cv.release()
lock.release()
# overide Thread's run()
def run(self):
graph = self.colony.graph
while not self.end():
# we need exclusive access to the graph
graph.lock.acquire()
new_node = self.state_transition_rule(self.curr_node)
self.path_cost += graph.delta(self.curr_node, new_node)
self.path_vec.append(new_node)
self.path_mat[self.curr_node][new_node] = 1 #adjacency matrix representing path
#print "Ant %s : %s, %s" % (self.ID, self.path_vec, self.path_cost,)
self.local_updating_rule(self.curr_node, new_node)
graph.lock.release()
self.curr_node = new_node
# close the tour
self.path_vec.append(self.path_vec[0])
#RUN LOCAL HEURISTIC
if self.daemon == True:
try:
daemon_result = twoOpt(self.path_vec, graph.delta_mat)
d_path, d_adj = daemon_result['path_vec'], daemon_result['path_matrix']
self.path_vec = d_path
self.path_mat = d_adj
except Exception, e:
print "exception: " + str(e)
traceback.print_exc()
self.path_cost += graph.delta(self.path_vec[-2], self.path_vec[-1])
# send our results to the colony
self.colony.update(self)
#print "Ant thread %s terminating." % (self.ID,)
# allows thread to be restarted (calls Thread.__init__)
self.__init__(self.ID, self.start_node, self.colony, self.daemon, self.Beta, self.Q0, self.Rho)
Solution to the problem: First of all, i corrected the error in the waiting of the condition variables, in according to the comments here. Second, it was still hanging sometimes, and this was due to a somewhat buggy mistake in the thread counter update. The solution was to change the counter from an int, to an array with length num_threads, full of 0's, where each thread updates its position in the list. When all threads finish, the counter array should be all 1's. This is currently working just fine.