Im new to python and having trouble passing in an object in a function. Basically, I'm trying to read a large file over 1.4B lines.
I am passing in an object that contains information on the file. One of these is a very large array containing the location of the start of each line in the file.
This is a large array and by passing just the object reference I wish to have just one instance of the array which is then shared by multiple processes although I don't know if this is actually happening.
The array when passed into the process_line function is then empty leading to errors. This is the problem.
Here is where the function is being called (see the p.starmap)
with open(file_name, 'r') as f:
line_start = file_inf.start_offset
# Iterate over all lines and construct arguments for `process_line`
while line_start < file_inf.file_size:
line_end = min(file_inf.file_size, line_start + line_size) #end = minimum of either file size and line start + line size
# Save `process_line` arguments
args = [file_name, line_start, line_end, file_inf.line_offset ] #arguments for process_line
line_args.append(args)
# Move to the next line
line_start = line_end
print(line_args[1])
with multiprocessing.Pool(cpu_count) as p: #run the process_line function on each line
# Run lines in parallel
# starmap() is like map() except the the we have multiple arguments in a list so we use starmap
line_result = p.starmap(process_line, line_args) #maps the process_line function to each line
This is the function:
def process_line(file_name,line_start, line_end, file_obj):
line_results = register()
c2 = register()
c1 = register()
with open(file_name, 'r') as f:
# Moving stream position to `line_start`
f.seek(file_obj[line_start])
i = 0
if line_start == 63400:
print ("hello")
# Read and process lines until `line_end`
for line in f:
line_start += 1
if line_start > line_end:
line_results.__append__(c2)
c2.clear()
break
c1 = func(line)
c2.__add__(c1)
i= i+1
return line_results.countf
where file_obj contains line_offset which is the array in question.
Now If I remove the multiprocessing and just use:
line_result = starmap(process_line, line_args)
the array is passed in just fine. Although without multiprocessing
Also if I pass in just the array instead of the entire object then it also works but now for some reason only 2 processes work (on Linux, on Windows using task manager only 1 works while the rest just use memory but not CPU). Instead of an expected 20 which is critical for this task.
Is there any solution to this? please help