bash append file from multiple thread

Question

I'm working on big data, I'm trying to parallelize my process functions. I can use several threads and process every user is a different thread (I have 200k users).

Every thread should append the first n lines of a file that produce, in an output file, shared between all the threads.

I wrote a Java program that execute head -n 256 thread_processed.txt >> output (every thread will do this)

I need the output file to be wrote in an atomic way.

If the thread A wrote lines from 0 to 9 and threads B wrote lines from 10 to 19 the output should be: [0...9 10... 19]. Lines can't overlaps, it can't be something like [0 1 2 17 18 3 4 ...]

How I can manage concurrent write access to the output file in a bash script?

Your Java code needs to write the output of each thread to a separate file, so that another thread can concatenate them in the correct order. You don't need all the threads to complete to concatenate the output from the first `k` threads, but you do need the first `k` to complete. — chepner, Feb 06 '17 at 19:52
Do a mega hack and use `sed` to write to a specific line But sirioslly if you know how to order do as chepner suggested or prefix the lines with a number and sort them. — bliof, Feb 06 '17 at 23:10
ps. or make the lines the same size and you'll be able to put them in the correct possitions easily from java — bliof, Feb 06 '17 at 23:18

score 6 · Accepted Answer · answered Feb 07 '17 at 01:41

sem from GNU Parallel should be able to do it:

sem --id mylock "head -n 256 thread_processed.txt >> output"

It will start a mutex named mylock.

If you are concerned that someone might read output while the head is running:

sem --id mylock "cp output o2; head -n 256 thread_processed.txt >> o2; mv o2 output"

bash append file from multiple thread

1 Answers1