I am trying to use multiprocessing
for a task that is very slow when done as a single process. As you can see in the code below, each process is supposed to return some results (return_dict
). I initially tested this code with a 10K row dataset (the data stored in the docs.txt
file, about 70mb), and the code ran as expected. However, when I used the script for the full data set (approx. 5.6gb), I got an AssertionError
as shown at the bottom of my question. I wonder if anyone knows what may have caused it and how I may be able to avoid it. Thanks.
from multiprocessing import Process, Manager
import os, io, numpy
from gensim.models.doc2vec import Doc2Vec
def worker(i, data, return_dict):
model = Doc2Vec.load("D:\\Project1\\doc2vec_model_DM_20180814.model")
results = numpy.zeros((len(data), model.vector_size))
for id, doc in enumerate(data):
results[id,:] = model.infer_vector(doc, alpha = 0.01, steps = 100)
return_dict[i] = results
if __name__ == '__main__':
import time
a = time.time()
path = "D:\\Project1\\docs.txt" # <<=== data stored in this file
data = []
manager = Manager()
jobs = []
return_dict = manager.dict()
with io.open(path, "r+", encoding = "utf-8") as datafile:
for id, row in enumerate(datafile):
row = row.strip().split('\t')[0].split()
data.append(row)
step = numpy.floor(len(data)/20)
intervals = numpy.arange(0, len(data), step = int(step)).tolist()
intervals.append(len(data))
for i in range(len(intervals) - 1):
p = Process(target=worker, args=(i, data[intervals[i]:intervals[i+1]], return_dict))
jobs.append(p)
p.start()
for proc in jobs:
proc.join()
results = numpy.zeros((len(data), 1000))
start = 0
end = 0
for _, result in return_dict.items(): #<<===Where error happens
end = end + result.shape[0]
results[start:end,:] = result[:,:]
start = end
print(time.time() - a)
Error message:
Traceback (most recent call last):
File "D:\Project1\multiprocessing_test.py", line 43, in <module>
for _, result in return_dict.items():
File "<string>", line 2, in items
File "C:\ProgramData\Anaconda3\lib\multiprocessing\managers.py", line 757, in _callmethod
kind, result = conn.recv()
File "C:\ProgramData\Anaconda3\lib\multiprocessing\connection.py", line 250, in recv
buf = self._recv_bytes()
File "C:\ProgramData\Anaconda3\lib\multiprocessing\connection.py", line 318, in _recv_bytes
return self._get_more_data(ov, maxsize)
File "C:\ProgramData\Anaconda3\lib\multiprocessing\connection.py", line 337, in _get_more_data
assert left > 0
AssertionError