The code below doesn't seem to run concurrently, and I'm not sure exactly why:
def run_normalizers(config, debug, num_threads, name=None):
def _run():
print('Started process for normalizer')
sqla_engine = init_sqla_from_config(config)
image_vfs = create_s3vfs_from_config(config, config.AWS_S3_IMAGE_BUCKET)
storage_vfs = create_s3vfs_from_config(config, config.AWS_S3_STORAGE_BUCKET)
pp = PipedPiper(config, image_vfs, storage_vfs, debug=debug)
if name:
pp.run_pipeline_normalizers(name)
else:
pp.run_all_normalizers()
print('Normalizer process complete')
threads = []
for i in range(num_threads):
threads.append(multiprocessing.Process(target=_run))
[t.start() for t in threads]
[t.join() for t in threads]
run_normalizers(...)
The config
variable is just a dictionary defined outside of the _run()
function. All of the processes seem to be created - but it isn't any faster than if I do it with a single process. Basically what's happening in the run_**_normalizers()
functions is reading from a queue table in a database (SQLAlchemy), then making a few HTTP requests, and then runing a 'pipeline' of normalizers to modify data and then save it back into the database. I'm coming from the JVM land where threads are 'heavy' and often used for parallelism - i'm a bit confused by this as i thought the multiprocess module was supposed to get around the limitations of Python's GIL.