0

Im new to python and having trouble passing in an object in a function. Basically, I'm trying to read a large file over 1.4B lines.

I am passing in an object that contains information on the file. One of these is a very large array containing the location of the start of each line in the file.

This is a large array and by passing just the object reference I wish to have just one instance of the array which is then shared by multiple processes although I don't know if this is actually happening.

The array when passed into the process_line function is then empty leading to errors. This is the problem.

Here is where the function is being called (see the p.starmap)

with open(file_name, 'r') as f:
    

    line_start = file_inf.start_offset
    # Iterate over all lines and construct arguments for `process_line`
    while line_start < file_inf.file_size:
        line_end = min(file_inf.file_size, line_start + line_size)    #end = minimum of either file size and line start + line size

        # Save `process_line` arguments
        args = [file_name, line_start, line_end, file_inf.line_offset ]                        #arguments for process_line
        line_args.append(args)

        # Move to the next line
        line_start = line_end
        
print(line_args[1])      
with multiprocessing.Pool(cpu_count) as p:  #run the process_line function on each line
# Run lines in parallel
# starmap() is like map() except the the we have multiple arguments in a list so we use starmap  
    line_result = p.starmap(process_line, line_args)    #maps the process_line function to each line

This is the function:

def process_line(file_name,line_start, line_end, file_obj):
line_results = register()
c2 = register()
c1 = register()
with open(file_name, 'r') as f:
    # Moving stream position to `line_start`
    f.seek(file_obj[line_start])
    i = 0
    if line_start == 63400:
        print ("hello")
    # Read and process lines until `line_end`
    
    for line in f:
        line_start += 1
        if line_start > line_end:
            line_results.__append__(c2)
            c2.clear()
            break
        c1 = func(line)
        c2.__add__(c1)   
        i= i+1       
return line_results.countf

where file_obj contains line_offset which is the array in question.

Now If I remove the multiprocessing and just use: line_result = starmap(process_line, line_args)

the array is passed in just fine. Although without multiprocessing

Also if I pass in just the array instead of the entire object then it also works but now for some reason only 2 processes work (on Linux, on Windows using task manager only 1 works while the rest just use memory but not CPU). Instead of an expected 20 which is critical for this task.

Processes

Is there any solution to this? please help

  • First of all, there is no array being passed in the multiprocessing case. Instead, "jobs" are being submitted to the pool by being written to a queue. Second, what do you mean by "Also if I pass in just the array instead of the entire object then it also works"? Third, having multiple processes go against the same dataset in parallel may kill performance unless you have a solid-state drive and even then the degree of multiprocessing by which you will benefit may be quite low before you start doing worse. – Booboo Apr 06 '22 at 12:28
  • Consider using method `imap` with a suitable *chunksize* argument with a generator function that yields one-by-one all the arguments for `process_line`, which now will take a single argument (a tuple or a list) that can be unpacked. – Booboo Apr 06 '22 at 12:30

0 Answers0