6

I'm working on big data, I'm trying to parallelize my process functions. I can use several threads and process every user is a different thread (I have 200k users).

Every thread should append the first n lines of a file that produce, in an output file, shared between all the threads.

I wrote a Java program that execute head -n 256 thread_processed.txt >> output (every thread will do this)

I need the output file to be wrote in an atomic way.

If the thread A wrote lines from 0 to 9 and threads B wrote lines from 10 to 19 the output should be: [0...9 10... 19]. Lines can't overlaps, it can't be something like [0 1 2 17 18 3 4 ...]

How I can manage concurrent write access to the output file in a bash script?

soundslikeodd
  • 1,078
  • 3
  • 19
  • 32
Progeny
  • 672
  • 1
  • 11
  • 25
  • 1
    Your Java code needs to write the output of each thread to a separate file, so that another thread can concatenate them in the correct order. You don't need all the threads to complete to concatenate the output from the first `k` threads, but you do need the first `k` to complete. – chepner Feb 06 '17 at 19:52
  • Do a mega hack and use `sed` to write to a specific line But sirioslly if you know how to order do as chepner suggested or prefix the lines with a number and sort them. – bliof Feb 06 '17 at 23:10
  • ps. or make the lines the same size and you'll be able to put them in the correct possitions easily from java – bliof Feb 06 '17 at 23:18

1 Answers1

6

sem from GNU Parallel should be able to do it:

sem --id mylock "head -n 256 thread_processed.txt >> output"

It will start a mutex named mylock.

If you are concerned that someone might read output while the head is running:

sem --id mylock "cp output o2; head -n 256 thread_processed.txt >> o2; mv o2 output"
Ole Tange
  • 31,768
  • 5
  • 86
  • 104