In my python code there is a For loop which reads files from a large list of filenames, reads information from these files and then write this information to a Numpy.ndarray. I realized that this for loop is taking a lot of time to complete and I can save time by parallelizing this process using Multiprocessing.Pool()
here is the for loop I wanted to parallelize looks like this (the actual code is different),
Matrix = [[0,0,0],[0,0,0],......[0,0,0]]
#a 2D numpy array of Zeros, this is where we want to write information to.
FileList = [file1,file2,file3....,fileN]
# a list containing the file names
for index in range (0,len(FileList)) :
Data = ReadDataFromFile(FileList[index])
#read some information from the file to the variable Data
Matrix[M][N] = Data
# the value of Data is written to the MNth element of the matrix
I was trying to parallelize this, that is instead of the system reading one file at a time, I wanted it to read as much files as possible parallelly.
I could not find a way to parallelize the 'For loop', so by following some examples saw in Stackoverflow, I wrote the For loop in the form of a function and then used from multiprocessing.Pool.map()
.
The function takes the filename as input, read the information from the file as given above, and then write this information to the Numpy.ndarray, which is already defined.Inside the function, I imported the array as global
so that the modifications made inside the function will be available outside the function.
def GetDataFromFile(filename) :
global Matrix
#calling Matrix as global variable
Data = ReadData(filename)
Matrix[M][N]
#writing information to global Matrix
when called, the function is working fine, and it is writing information to the array from file.
But when I tried to parallelize the process with multiprocessing. Pool.map()
it is not working as expected, that is the modifications made by the function 'GetDataFromFile' is not changing the values of the ndarray globally.
import multiprocessing
p = multiprocessing.Pool()
Matrix = [[0,0,0],[0,0,0],......[0,0,0]]
#a 2D numpy array of Zeros, this is where we want to write information to.
FileList = [file1,file2,file3....,fileN]
def GetDataFromFile(filename) :
global Matrix
#calling Matrix as global variable
Data = ReadData(filename)
Matrix[M][N]
#writing information to global Matrix
p.map(GetDataFromFile,FileList)
print Matrix
The output of the above code gives all zeroes, that is the function is not adding information to the ndarray when used with multiprocessing.Pool.map()
.
What is the problem here ? How can we fix this? Is there any alternative way to achieve the same ?
Thanks in advance, I'm using Python2.7 on Ubuntu 16.04 LTS.