A data file that i need to read is too big, and loading it to a list takes too long time. How can I use multiproces for this? In other words, I would like to parallelise the process file read and load to a list. Could you please help.
Basically, I have a data table, that I need to load to a list, something as below. Reading the file does not take time, but loading it to a list (myList) takes about 1 minute. So, is it possible to parallelise this:
def load_file(self, fileName):
time_start = time.time()
myList = []
# mySet = set()
lines = self.read_file(fileName)
# time_end = time.time()
# print fileName, ": loaded ", round(time_end-time_start, 4)," seconds"
for line in lines:
content = line.split()
myList.append(content)
time_end = time.time()
print fileName, ": ", len(myList), " rows loaded in", round(time_end-time_start, 4)," seconds"
return myList
def read_file(self, fileName):
filePath = self.data_directory + '\\' + fileName
try:
with open(filePath, 'r') as f:
lines = f.readlines()
f.close()
return lines
except ValueError:
print filePath + ' does not exist'
An adhoc way could be, (assume that the file has 2M lines, so len(lines) = 2M), load first 1M to myList1, and second 1M to myList2 in parallel, and then merge them, myList = myList1+myList2. But this doesn't sound like the best practice.