first time poster, please be patient with me.
I'm using Python 2.7.12 and I want to use the multiprocessing module to speed up my code.
My program get's files in all sort of formats(pdf, xslx, docx, etc.) and extracts the data inside them and saves it in a .txt or .csv file.
The problem is that I have literally thousands of files to process and using one core it takes more than 1 hour for a test folder(filled with different files) that I'm using.
I have read many different examples of the use of pool and apply_async, map_async, map... But whenever I try to adapt them to my code they run but don't do anything and throw no error.
The only multiprocessing method I got to work is multiprocessing.Process but I have no way to tell it to run only on 8 cores so it starts as many processes as files I pass to him. Not very ellegant but does the work.
I'm currently learning python and would like to learn the correct way to use the library so I beg your help. Please help me to learn to use Pool and apply_async and map_async.
Below I'm gonna attach the code running on a single core, using multicore with Process, and using a non-working apply_async so you can advise me better.
Using one core:
def start(self, fileorfolder):
"""
Starts the function and converts files and folders
:param fileorfolder: str. absolute path to a file or folder
:return:
"""
print self.folder_dictio
if os.path.isdir(fileorfolder):
archives = (file for file in os.listdir(fileorfolder)
if os.path.isfile(os.path.join(fileorfolder, file)))
for data in archives:
out_q = multiprocessing.Queue()
p = multiprocessing.Process(target=self.selector, args=(os.path.join(fileorfolder, data), out_q))
p.start()
# Wait for 120 seconds or until process finishes
p.join(120)
text = ""
version = ""
extension = ""
version, extension, text = out_q.get()
path, name = os.path.split(data)
if (extension == "xlsx") or (extension == "ods"):
with open(os.path.join(self.folder_dictio['csv'], name + ".csv"), 'wb') \
as archive:
archive.write(text)
else:
with open(os.path.join(self.folder_dictio['txt'], name + ".txt"), 'wb') \
as archive:
archive.write(text)
Multicore using Process:
def start(self, fileorfolder):
"""
Starts the function and converts files and folders
:param fileorfolder: str. absolute path to a file or folder
:return:
"""
start = time.time()
print self.folder_dictio
if os.path.isdir(fileorfolder):
archives = (file for file in os.listdir(fileorfolder)
if os.path.isfile(os.path.join(fileorfolder, file)))
lista = []
for files in archives:
lista.append(os.path.join(fileorfolder, files))
jobs = []
for i in lista:
p = multiprocessing.Process(target=self.selector, args=(i,))
jobs.append(p)
p.start()
for joc in jobs:
joc.join(120)
joc.terminate()
Non-Working apply_async:
def start(self, fileorfolder):
"""
Starts the function and converts files and folders
:param fileorfolder: str. absolute path to a file or folder
:return:
"""
start = time.time()
print self.folder_dictio
if os.path.isdir(fileorfolder):
archives = (file for file in os.listdir(fileorfolder)
if os.path.isfile(os.path.join(fileorfolder, file)))
lista = []
for files in archives:
lista.append(os.path.join(fileorfolder, files))
pool = multiprocessing.Pool()
for items in lista:
pool.apply_async(self.selector, items)
pool.close()
pool.join()