multiprocess or multithread? - parallelizing a simple computation for millions of iterations and storing the result in a single data structure

Question

I have a dictionary D of {string:list} entries, and I compute a function f( D[s1],D[s2] ) --> float for a pair of strings (s1,s2) in D.

Additionally, I have created a custom matrix class LabeledNumericMatrix that allows me to perform assignments such as m[ ID1, ID2 ] = 1.0 .

I need to calculate f(x,y) and store the result in m[x,y] for all 2-tuples in the set of strings S, including when s1=s2. This is easy to code as a loop, but execution of this code takes quite some time as the size of the set S grows to large values such as 10,000 or more.

None of the results I store in my labeled matrix m depend on each other. Therefore, it seems straightforward to parallelize this computation by using python's multithread or multiprocess services. However, since cPython doesn't truly allow my to simultaneously execute calculation of f(x,y) and storage of m[x,y] through threading, it seems that multiprocess is my only choice. However, I don't think multiprocess is designed to pass around 1GB data structures between processes, such as my labelled matrix structure containing 10000x10000 elements.

Can anyone provide advice of (a) if I should avoid trying to parallelize my algorithm, and (b) if I can do the parallelization, how to do such, preferably in cPython?

Have you tried running your code under PyPy? I found it to be around 20 times faster than CPython in some occasions. It might be enough for you to avoid parallelization altogether. — zmbq, Feb 20 '12 at 12:05

score 6 · Answer 1 · edited Jun 20 '20 at 09:12

First option - a Server Process

Create a Server process. It's part of the Multiprocessing package which allows parallel access to data structures. This way every process will access the data structure directly, locking other processes.

From the documentation:

Server process

A manager object returned by Manager() controls a server process which holds Python objects and allows other processes to manipulate them using proxies.

A manager returned by Manager() will support types list, dict, Namespace, Lock, RLock, Semaphore, BoundedSemaphore, Condition, Event, Queue, Value and Array.

Second option - Pool of workers

Create a Pool of workers, an input Queue and a result Queue.

The main process, acting as a producer, will feed the input queue with pairs (s1, s2).
Each worker process will read a pair from the input Queue, and write the result into the output Queue.
The main thread will read the results from the result Queue, and write them into the result dictionary.

Third option - divide to independent problems

Your data is independent: f( D[s_i],D[s_j] ) is a secluded problem, independent of any f( D[s_k],D[s_l] ) . furthermore, the computation time of each pair should be fairly equal, or at least in the same order of magnitude.

Divide the task into n inputs sets, where n is the number of computation units (cores, or even computers) you have. Give each input set to a different process, and join the output.

score 2 · Answer 2 · edited Feb 20 '12 at 10:53

2

You definitely won't get any performance increase with threading - it is an inappropriate tool for cpu-bound tasks.

So the only possible choice is multiprocessing, but since you have a big data structure, I'd suggest something like mmap (pretty low level, but builtin) or Redis (tasty and high level API, but should be installed and configured).

edited Feb 20 '12 at 10:53

hochl

12,524
10
53
87

answered Feb 20 '12 at 10:14

Roman Bodnarchuk

29,461
12
59
75

score 1 · Answer 3 · answered Feb 20 '12 at 12:34

Have you profiled your code? Is it just calculating f that is too expensive, or storing the results in the data structure (or maybe both)?

If f is dominant, then you should make sure you can't make algorithmic improvements before you start worrying about parallelization. You might be able to get a big speed up by turning some or all of the function into a C extension, perhaps using cython. If you do go with multiprocessing, then I don't see why you need to pass the entire data structure between processes?

If storing results in the matrix is too expensive, you might speed up your code by using a more efficient data structure (like array.array or numpy.ndarray). Unless you have been very careful designing and implementing your custom matrix class, it will almost certainly be slower than those.

The matrix class I created is just a small wrapper around numpy's matrix storage class, providing an internal mapping from IDs to row/column numbers, and some other functionality associated with labelled columns and rows. I have done some profiling, but not comprehensively enough to provide a quality report. — J.B. Brown, Mar 21 '12 at 15:37

J.B. Brown · Answer 4 · 2012-03-21T15:56:56.777

Thank you everyone for your responses.

I have created a solution (not "the" solution) to the proposed problem, and since others may find it useful, I am posting the code here. My solution is a hybrid of options 1 and 3 suggested by Adam Matan. The code contains line numbers from my vi session, which will help in the discussion below.

 12 # System libraries needed by this module.
 13 import numpy, multiprocessing, time
 14 
 15 # Third-party libraries needed by this module.
 16 import labeledMatrix
 17 
 18 # ----- Begin code for this module. -----
 19 from commonFunctions import debugMessage
 20 
 21 def createSimilarityMatrix( fvFileHandle, fvFileParser, fvSimScorer, colIDs, rowIDs=None,
 22                             exceptionType=ValueError, useNumType=numpy.float, verbose=False,
 23                             maxProcesses=None, processCheckTime=1.0 ):
 24  """Create a labeled similarity matrix from vectorial data in [fvFileHandle] that can be
 25  parsed by [fvFileParser].
 26  [fvSimScorer] should be a function that can return a floating point value for a pair of vectors.
 27 
 28  If the matrix [rowIDs] are not specified, they will be the same as the [colIDs].
 29 
 30  [exceptionType] will be raised when a row or column ID cannot be found in the vectorial data.
 31  [maxProcesses] specifies the number of CPUs to use for calculation; default value is all available CPUs.
 32  [processCheckTime] is the interval for checking activity of CPUs (if completed calculation or not).
 33 
 34  Return: a LabeledNumericMatrix with corresponding row and column IDs."""
 35 
 36  # Setup row/col ID information.
 37  useColIDs = list( colIDs )
 38  useRowIDs = rowIDs or useColIDs
 39  featureData = fvFileParser( fvFileHandle, retainIDs=(useColIDs+useRowIDs) )
 40  verbose and debugMessage( "Retrieved %i feature vectors from FV file." % len(featureData) )
 41  featureIDs = featureData.keys()
 42  absentIDs = [ ID for ID in set(useColIDs + useRowIDs) if ID not in featureIDs ]
 43  if absentIDs: 
 44   raise exceptionType, "IDs %s not found in feature vector file." % absentIDs
 45  # Otherwise, proceed to creation of matrix.
 46  resultMatrix = labeledMatrix.LabeledNumericMatrix( useRowIDs, useColIDs, numType=useNumType )
 47  calculateSymmetric = True if set( useRowIDs ) == set( useColIDs ) else False
 48  
 49  # Setup data structures needed for parallelization.
 50  numSubprocesses = multiprocessing.cpu_count() if maxProcesses==None else int(maxProcesses)
 51  assert numSubprocesses >= 1, "Specification of %i CPUs to calculate similarity matrix." % numSubprocesses
 52  dataManager = multiprocessing.Manager()
 53  sharedFeatureData = dataManager.dict( featureData )
 54  resultQueue = multiprocessing.Queue() 
 55  # Assign jobs evenly through number of processors available.
 56  jobList = [ list() for i in range(numSubprocesses) ]
 57  calculationNumber = 0 # Will hold total number of results stored.
 58  if calculateSymmetric: # Perform calculations with n(n+1)/2 pairs, instead of n^2 pairs.
 59   remainingIDs = list( useRowIDs )
 60   while remainingIDs:
 61    firstID = remainingIDs[0]
 62    for secondID in remainingIDs:
 63     jobList[ calculationNumber % numSubprocesses ].append( (firstID, secondID) )
 64     calculationNumber += 1
 65    remainingIDs.remove( firstID )
 66  else: # Straight processing one at a time.
 67   for rowID in useRowIDs:
 68    for colID in useColIDs:
 69     jobList[ calculationNumber % numSubprocesses ].append( (rowID, colID) )
 70     calculationNumber += 1
 71     
 72  verbose and debugMessage( "Completed setup of job distribution: %s." % [len(js) for js in jobList] )
 73  # Define a function to perform calculation and store results
 74  def runJobs( scoreFunc, pairs, featureData, resultQueue ):
 75   for pair in pairs:
 76    score = scoreFunc( featureData[pair[0]], featureData[pair[1]] )
 77    resultQueue.put( ( pair, score ) )
 78   verbose and debugMessage( "%s: completed all calculations." % multiprocessing.current_process().name )
 79   
 80   
 81  # Create processes to perform parallelized computing.
 82  processes = list()
 83  for num in range(numSubprocesses):
 84   processes.append( multiprocessing.Process( target=runJobs,
 85                                              args=( fvSimScorer, jobList[num], sharedFeatureData, resultQueue ) ) )
 86  # Launch processes and wait for them to all complete.
 87  import Queue # For Queue.Empty exception.
 88  for p in processes:
 89   p.start()
 90  assignmentsCompleted = 0
 91  while assignmentsCompleted < calculationNumber:
 92   numActive = [ p.is_alive() for p in processes ].count( True )
 93   verbose and debugMessage( "%i/%i complete; Active processes: %i" % \
 94               ( assignmentsCompleted, calculationNumber, numActive ) )
 95   while True: # Empty queue immediately to avoid underlying pipe/socket implementation from hanging.
 96    try: 
 97     pair, score = resultQueue.get( block=False )
 98     resultMatrix[ pair[0], pair[1] ] = score
 99     assignmentsCompleted += 1
100     if calculateSymmetric:
101      resultMatrix[ pair[1], pair[0] ] = score
102    except Queue.Empty:
103     break 
104   if numActive == 0: finished = True
105   else:
106    time.sleep( processCheckTime )
107  # Result queue emptied and no active processes remaining - completed calculations.
108  return resultMatrix
109 ## end of createSimilarityMatrix()

Lines 36-47 are simply preliminary stuff related to the problem definition that was a part of the original question. The setup for the multiprocessing to get around cPython's GIL is in lines 49-56, with lines 57-70 used to evenly create the subdivided tasks. Code in lines 57-70 is used instead of itertools.product, because when the list of row/column IDs reaches 40,000 or so, the product ends up taking an enormous amount of memory.

The actual computation to be performed is in lines 74-78, and here the shared dictionary of ID->vector entries and shared result queue are utilized.

Lines 81-85 setup the actual Process objects, though they haven't actually been started yet.

In my first attempt (not shown here), the "try ... resultQueue.get() and assign except ..." code was actually located outside of the outer control loop (while not all calculations finished). When I ran that version of the code on a unit test of a 9x9 matrix, there were no problems. However, moving up to 200x200 or larger, I found this code to hang, despite not changing anything in the code between executions.

According to this discussion (http://bugs.python.org/issue8426) and the official documentation for multiprocess, the use of multiprocess.Queue can hang if the underlying implementation doesn't have a very large pipe/socket size. Therefore, the code given here as my solution periodically empties the queue while checking on completion of processes (see lines 91-106) so that the child processes can continue to put new results in it and avoid the pipe being overfull.

When I tested the code on larger matrices of 1000x1000, I noticed that the computation code finished well ahead of the Queue and matrix assignment code. Using cProfile, I found that one bottleneck was the default polling interval processCheckTime=1.0 (line 23), and lowering this value improved the speed of results (see bottom of post for timing examples). This might be useful information for other people new to multiprocessing in Python.

Overall, this probably is not be the best implementation possible, but it does provide a starting point for further optimization. As is often said, optimization via parallelization requires proper analysis and thought.

Timing examples, all with 8 CPUs.

200x200 (20100 calculations/assignments)

t=1.0 : execution time 18s

t=0.01: execution time 3s

500x500 (125250 calculations/assignments)

t=1.0 : execution time 86s

t=0.01: execution time 23s

In case anyone wants to copy-and-paste the code, here is a unit-test I used for part of the development. Obviously, the labelled matrix class code is not here, and the fingerprint reader/scorer code is not included (though it is pretty simple to roll your own). Of course, I'm happy to share that code as well if would help someone.

112 def unitTest():
113  import cStringIO, os
114  from fingerprintReader import MismatchKernelReader
115  from fingerprintScorers import FeatureVectorLinearKernel
116  exampleData = cStringIO.StringIO() # 9 examples from GPCR (3,1)-mismatch descriptors, first 10 columns.
117  exampleData.write( ",AAA,AAC,AAD,AAE,AAF,AAG,AAH,AAI,AAK"  + os.linesep )
118  exampleData.write( "TS1R2_HUMAN,5,2,3,6,8,6,6,7,4" + os.linesep )
119  exampleData.write( "SSR1_HUMAN,11,6,5,7,4,7,4,7,9" + os.linesep )
120  exampleData.write( "OXYR_HUMAN,27,13,14,14,15,14,11,16,14" + os.linesep )
121  exampleData.write( "ADA1A_HUMAN,7,3,5,4,5,7,3,8,4" + os.linesep )
122  exampleData.write( "TA2R_HUMAN,16,6,7,8,9,10,6,6,6" + os.linesep )
123  exampleData.write( "OXER1_HUMAN,10,6,5,7,11,9,5,10,6" + os.linesep )
124  exampleData.write( "NPY1R_HUMAN,3,3,0,2,3,1,0,6,2" + os.linesep )
125  exampleData.write( "NPSR1_HUMAN,0,1,1,0,3,0,0,6,2" + os.linesep )
126  exampleData.write( "HRH3_HUMAN,16,9,9,13,14,14,9,11,9" + os.linesep )
127  exampleData.write( "HCAR2_HUMAN,3,1,3,2,5,1,1,6,2" )
128  columnIDs = ( "TS1R2_HUMAN", "SSR1_HUMAN", "OXYR_HUMAN", "ADA1A_HUMAN", "TA2R_HUMAN", "OXER1_HUMAN",
129                "NPY1R_HUMAN", "NPSR1_HUMAN", "HRH3_HUMAN", "HCAR2_HUMAN", )
130  m = createSimilarityMatrix( exampleData, MismatchKernelReader, FeatureVectorLinearKernel, columnIDs,
131                              verbose=True, )
132  m.SetOutputPrecision( 6 )
133  print m
134 
135 ## end of unitTest()

You're probably aware of this, but the way you are formatting your code is a little unusual, and in my opinion makes it quite hard to read. In particular, most people use four spaces for indentation - I don't think one space is really enough for it to be clear at a glance which indentation level you are at. The advice in [PEP 8](http://www.python.org/dev/peps/pep-0008/) is really very helpful, especially when people start sharing code with each other. — James, Mar 21 '12 at 16:51
Indeed, my code doesn't follow PEP8 word-for-word. As I prefer to have as few literals as possible in my code, and use variable names that describe the intent of the code, limiting my lines to 79 characters and using four spaces per indentation would cause the number of lines to grow considerably. I definitely agree with you that the logic in lines 91-106 would benefit from code that uses more indentation, but chose to present the code this way with comments to begin each logical block in order to limit its length. Thank you for various types of advice. — J.B. Brown, Mar 22 '12 at 02:01
Since posting this code, I've discovered that it does not scale up to examples of 10,000 or more entries, namely because of memory consumption issues arising from behind-the-scenes work necessary in Python multiprocessing for updating data structures. It seems that multiprocessing + SQLite/MySQL will be necessary in order to scale the problem to larger datasets. — J.B. Brown, Mar 27 '12 at 02:29

J.B. Brown · Answer 5 · 2012-03-29T07:17:20.327

In reference to my last comment attached to the code posted on March 21, I found multiprocessing.Pool + SQLite (pysqlite2) unusable for my particular task, as two problems occurred:

(1) Using the default connection, except for the first worker, every other worker process that performed an insert query only executed once. (2) When I change the connection keywords to check_same_thread=False, then the full pool of workers is used, but then only some queries succeed and some queries fail. When each worker also executed time.sleep(0.01), the number of query failures was reduced, but not entirely. (3) Less importantly, I could hear my hard disk reading/writing frantically, even for a small job list of 10 insert queries.

I next resorted to MySQL-Python, and things worked out much better. True, one must setup the MySQL server daemon, a user, and a database for that user, but those steps are relatively simple.

Here is sample code that worked for me. Obviously it could be optimized, but it conveys the basic idea for those who are looking for how to get starting using multiprocessing.

  1 from multiprocessing import Pool, current_process
  2 import MySQLdb
  3 from numpy import random
  4
  5 
  6 if __name__ == "__main__":
  7  
  8   numValues   = 50000
  9   tableName   = "tempTable"
 10   useHostName = ""
 11   useUserName = ""  # Insert your values here.
 12   usePassword = ""
 13   useDBName   = ""
 14   
 15   # Setup database and table for results.
 16   dbConnection = MySQLdb.connect( host=useHostName, user=useUserName, passwd=usePassword, db=useDBName )
 17   topCursor = dbConnection.cursor()
 18   # Assuming table does not exist, will be eliminated at the end of the script.
 19   topCursor.execute( 'CREATE TABLE %s (oneText TEXT, oneValue REAL)' % tableName )
 20   topCursor.close() 
 21   dbConnection.close()
 22   
 23   # Define simple function for storing results.
 24   def work( storeValue ):
 25     #print "%s storing value %f" % ( current_process().name, storeValue )
 26     try:
 27       dbConnection = MySQLdb.connect( host=useHostName, user=useUserName, passwd=usePassword, db=useDBName )
 28       cursor = dbConnection.cursor()
 29       cursor.execute( "SET AUTOCOMMIT=1" )
 30       try:
 31         query = "INSERT INTO %s VALUES ('%s',%f)" % ( tableName, current_process().name, storeValue )
 32         #print query
 33         cursor.execute( query )
 34       except:
 35         print "Query failed."
 36       
 37       cursor.close()
 38       dbConnection.close()
 39     except: 
 40       print "Connection/cursor problem."
 41   
 42   
 43   # Create set of values to assign
 44   values = random.random( numValues )
 45   
 46   # Create pool of workers 
 47   pool = Pool( processes=6 )
 48   # Execute assignments.
 49   for value in values: pool.apply_async( func=work, args=(value,) )
 50   pool.close()
 51   pool.join()
 52 
 53   # Cleanup temporary table.
 54   dbConnection = MySQLdb.connect( host=useHostName, user=useUserName, passwd=usePassword, db=useDBName )
 55   topCursor = dbConnection.cursor()
 56   topCursor.execute( 'DROP TABLE %s' % tableName )
 57   topCursor.close()
 58   dbConnection.close()

multiprocess or multithread? - parallelizing a simple computation for millions of iterations and storing the result in a single data structure

5 Answers5

First option - a Server Process

Second option - Pool of workers

Third option - divide to independent problems