Given a directory with a few million files in it we want to extract some data from those files.
find /dir/ -type f | awk -F"|" '$2 ~ /string/{ print $3"|"$7 }' > the_good_stuff.txt
That will never scale so we introduce xargs.
find /dir/ -type f -print0 | xargs -0 -n1 -P6 awk -F"|" '$2 ~ /string/{ print $3"|"$7 }'
This produces valid output no matter how long we run it. Sweet so lets write it to a file by appending a > the_good_stuff_from_xargs.txt
onto that command. Except now the file contains mangled lines.
What strikes me is that, while viewing the output of the six subprocesses that xargs opens as STDOUT in my terminal, the data look fine. The moment the data is redirected onto the filesystem is when corruption appears.
I've attempted appending the command with the following.
> myfile.txt
>> myfile.txt
| mawk '{print $0}' > myfile.txt
And various other concepts of redirecting or otherwise "pooling" the output of the xargs before writing it to disk with data being corrupted in each version.
I'm positive the raw files are not malformed. I'm positive that when viewed in terminal as stdout the command with xargs produces valid output for up to 10 minutes of staring at it spit text...
Local disk is an SSD... I'm reading and writing from the same file system.
Why does redirecting the output of find /dir/ -type f -print0 | xargs -0 -n1 -P6 awk -F"|" '$2 ~ /string/{ print $3"|"$7 }'
cause the data to become malformed?
EDIT
I can't currently install unbuffer but stdbuf -oL -eL
modifies the commands output to be line buffered and so, theoretically, should do the same thing.
I've tried both stdbuf xargs cmd
and xargs stdbuf cmd
both have resulted in exceedingly broken lines.
the -P6
is required in order for this command to complete in any reasonable amount of time.
EDIT 2
To clarify... xargs
and it's -P6
flag are requirements to solve the problem because the directory we are working in has millions of files that must be scanned.
Obviously we could remove -P6
or in some other fashion stop running multiple jobs at once but that's not really answering the question of why the output is getting mangled nor is it a realistic approach to how the output can be restored to a state of "correct" while still accomplishing the task at scale.
Solution
The accepted answer mentioned using parallel
which worked the best out of all the answers.
The final command I ran looked like.
time find -L /dir/ -type f -mtime -30 -print0 | parallel -0 -X awk -f manual.awk > the_good_stuff.txt
Awk was being difficult so I moved the -F"|"
into the command itself. By default parallel will spin up a job per core on the box, you can use -j
to set the number of jobs lower if need be.
In really scientific terms this was a massive speed increase. What took an unmeasured number of hours ( likely 6+ ) is 10% completed after 6 six minutes, so will likely finish within an hour.
One catch is that you have to make sure the command running in parallel
isn't attempting to write to file... that effectively bypasses the output processing that parallel performs on the jobs it runs!
Lastly without -X
parallel acts similar to xargs -n1
.