Multi Processing python: Loop running extra times

Question

I am trying to use multiprocessing to append to a csv file. I have multiple csv files that I am looping over. This function works with a normal for loop but does not work with multi-processing. Hope someone can shed some light on this.

My function code is as follows:

 def read_write2(j, lock):
    #i = 2
    with open('C:\\Users\\user\\Documents\\filereader\\FileFolder\\sample_new{}.csv'.format(j), "r") as a_file: #input file
        #i = i + 1
        with open('samples2.csv','a') as file: #output file
            for line in a_file:
                lock.acquire()
                stripped_line = line.strip()
                a = len(stripped_line)
                if "©" in stripped_line or "flow" in stripped_line or a>254:
                    pass
                else:
                    file.write(stripped_line)
                    file.write("\n")
                lock.release()

My multiprocessing code here is as follows:

if __name__ == "__main__":
    lock = Lock()
    processes = []

    for i in range(2,fileno+1):
        print(i)
        process = Process(target=read_write2, args=(i,lock)) #creating a new process
        processes.append(process) #appending process to a processes list

    for process in processes:
        print(process)
        process.start()

    for process in processes: #loop over list to join process
        process.join() #process will finish before moving on with the script

Output is as follows:

7
2
3
4
5
6
7
<Process name='Process-1' parent=24328 initial>
<Process name='Process-2' parent=24328 initial>
<Process name='Process-3' parent=24328 initial>
<Process name='Process-4' parent=24328 initial>
<Process name='Process-5' parent=24328 initial>
<Process name='Process-6' parent=24328 initial>
7
7
7
7
7
7

Thank you.

it may not be good idea to use the same file in many processes. System may run porcesses in random order and you get results in random order. And when two processes will try to write to the same file then it may destroy data in file. You should use processes to work with data but later send all to one process and it should write all data. — furas, May 30 '22 at 02:25

score 0 · Answer 1 · answered May 30 '22 at 02:12

0

Yeah. Not going to work. Each of your threads has a different "handle" into the file, since you've opened it multiple times. You're going to need to open it once, and pass that to the threads.

answered May 30 '22 at 02:12

Frank Yellin

9,127
1
12
22

Hi! Thanks for your reply. Can you please tell me where the error is? – meister May 30 '22 at 02:20
You open the output file five times. Each of those "opens" is independently writing to the same file, starting from the beginning. Add to that that Python does buffering, and your end result is just going to be a mess. – Frank Yellin May 30 '22 at 03:11

micromoses · Answer 2 · 2022-05-30T04:08:46.320

As already mentioned, you open and write to the same file multiple times, and without file locking or synchronization that would cause trouble. This may happen due to the position in the file not updating between the processes, so one process is not aware that a different process wrote to the file, and starts writing to it from the same position as the other process(es). There are better ways to do this, but trying to apply minimal adjustments to your code, I suggest using the lock to open, write, and close the output file, so the order looks as such:

with open('C:\\Users\\user\\Documents\\filereader\\FileFolder\\sample_new{}.csv'.format(j), "r") as a_file: #input file
    for line in a_file:
        if ...:
            ...
        else:
            lock.acquire()
            with open('samples2.csv','a') as file: #output file
                ...
            lock.release()

Although this would cause heavy overhead on disk I/O, this should be the minimal change to your code to make it work using multiprocessing. The whole function would then be:

def read_write2(j, lock):
    with open('C:\\Users\\user\\Documents\\filereader\\FileFolder\\sample_new{}.csv'.format(j), "r") as a_file: #input file
        for line in a_file:
            stripped_line = line.strip()
            a = len(stripped_line)
            if "©" in stripped_line or "flow" in stripped_line or a>254:
                pass
            else:
                lock.acquire()
                with open('samples2.csv','a') as file: #output file
                    file.write(stripped_line)
                    file.write("\n")
                lock.release()

P.S. depending on the amount of files, file sizes, the amount of output lines, and a lot of other factors, it may be more efficient for each process to write to its own file, and later collate the output into one file, in the main loop. This saves a lot of file open/close, and eliminates the need for locks. For example, rewriting the function as follows:

def read_write2(j):
    with open('C:\\Users\\user\\Documents\\filereader\\FileFolder\\sample_new{}.csv'.format(j), "r") as a_file: #input file
        with open('samples2_{}.csv'.format(j),'a') as file: #output file
            for line in a_file:
                stripped_line = line.strip()
                a = len(stripped_line)
                if "©" in stripped_line or "flow" in stripped_line or a>254:
                    pass
                else:
                    file.write(stripped_line)
                    file.write("\n")

Then, in the main code (under if __name__ == "__main__":), replace the following code:

for process in processes: #loop over list to join process
    process.join() #process will finish before moving on with the script

with this:

with open('samples2.csv', 'w') as out_f:
    for e, process in enumerate(processes):
        process.join()
        with open('samples2_{}.csv'.format(e+2), 'r') as in_f:
            out_f.write(in_f.read())    # NOTE: this is highly inefficient, and may consume too much memory. But that's not relevant to the question at hand.

score 0 · Answer 3 · answered May 30 '22 at 06:09

Thank you all for your inputs. They were instrumental in finding my answer.

The answer was actually to just put everything into main. It seems to be working fine and it has solved the error. I'm checking for 1000s and 1000s of urls.

I put all my function declarations into if name == "main" and was able to solve it.

Thank you all again. :)

Multi Processing python: Loop running extra times

3 Answers3