3

Given a directory with a few million files in it we want to extract some data from those files.

find /dir/ -type f | awk -F"|" '$2 ~ /string/{ print $3"|"$7 }' > the_good_stuff.txt

That will never scale so we introduce xargs.

find /dir/ -type f -print0 | xargs -0 -n1 -P6 awk -F"|" '$2 ~ /string/{ print $3"|"$7 }'

This produces valid output no matter how long we run it. Sweet so lets write it to a file by appending a > the_good_stuff_from_xargs.txt onto that command. Except now the file contains mangled lines.

What strikes me is that, while viewing the output of the six subprocesses that xargs opens as STDOUT in my terminal, the data look fine. The moment the data is redirected onto the filesystem is when corruption appears.

I've attempted appending the command with the following.

> myfile.txt

>> myfile.txt

| mawk '{print $0}' > myfile.txt

And various other concepts of redirecting or otherwise "pooling" the output of the xargs before writing it to disk with data being corrupted in each version.

I'm positive the raw files are not malformed. I'm positive that when viewed in terminal as stdout the command with xargs produces valid output for up to 10 minutes of staring at it spit text...

Local disk is an SSD... I'm reading and writing from the same file system.

Why does redirecting the output of find /dir/ -type f -print0 | xargs -0 -n1 -P6 awk -F"|" '$2 ~ /string/{ print $3"|"$7 }' cause the data to become malformed?

EDIT

I can't currently install unbuffer but stdbuf -oL -eL modifies the commands output to be line buffered and so, theoretically, should do the same thing.

I've tried both stdbuf xargs cmd and xargs stdbuf cmd both have resulted in exceedingly broken lines.

the -P6 is required in order for this command to complete in any reasonable amount of time.

EDIT 2

To clarify... xargs and it's -P6 flag are requirements to solve the problem because the directory we are working in has millions of files that must be scanned.

Obviously we could remove -P6 or in some other fashion stop running multiple jobs at once but that's not really answering the question of why the output is getting mangled nor is it a realistic approach to how the output can be restored to a state of "correct" while still accomplishing the task at scale.

Solution

The accepted answer mentioned using parallel which worked the best out of all the answers.

The final command I ran looked like. time find -L /dir/ -type f -mtime -30 -print0 | parallel -0 -X awk -f manual.awk > the_good_stuff.txt Awk was being difficult so I moved the -F"|" into the command itself. By default parallel will spin up a job per core on the box, you can use -j to set the number of jobs lower if need be.

In really scientific terms this was a massive speed increase. What took an unmeasured number of hours ( likely 6+ ) is 10% completed after 6 six minutes, so will likely finish within an hour.

One catch is that you have to make sure the command running in parallel isn't attempting to write to file... that effectively bypasses the output processing that parallel performs on the jobs it runs!

Lastly without -X parallel acts similar to xargs -n1.

A Brothers
  • 536
  • 2
  • 9
  • 2
    Standard output is line-buffered when writing to a terminal, but it's fully-buffered when writing to a pipe or file. – Barmar Dec 28 '16 at 22:18
  • Use the `unbuffer` command that comes with `Expect`. – Barmar Dec 28 '16 at 22:20
  • 5
    Remove the `-P6`; that causes 6 asynchronous processes to write at random times to your output, and they write partial lines as the buffer fills, and different processes write different partial lines at different points, etc. If you must use `-P6`, you need to have the 6 processes writing to different files so that they don't trample on each other's output. That in turn may mean running a shell script that runs `awk` and does I/O redirection to a separate file (use `mktemp` perhaps, of base the name on the PID of the script). – Jonathan Leffler Dec 28 '16 at 22:29
  • 1
    It sounds like you should use `parallel` instead of `xargs`, since it manages the commands' output to avoid this sort of trouble. See [this previous question](http://stackoverflow.com/questions/32450489/xargs-losing-output-when-redirecting-stdout-to-a-file-in-parallel-mode). – Gordon Davisson Dec 29 '16 at 05:29
  • 1
    Definitely you should use GNU `parallel` – Dario Dec 29 '16 at 07:52
  • 1
    `parallel -q` quotes the command string so that you could use raw awk `-F"|"` instead of a separate `.awk` file. – webb Dec 29 '16 at 21:17

2 Answers2

3

man xargs mentions this problem: "Please note that it is up to the called processes to properly manage parallel access to shared resources. For example, if more than one of them tries to print to stdout, the ouptut will be produced in an indeterminate order (and very likely mixed up)"

luckily, there is a way to make this operation an order of magnitude faster and solve the mangling problem at the same time:

find /dir/ -type f -print0 | xargs -0 awk -F"|" '$2 ~ /string/{ print $3"|"$7 }'

why?

-P6 is shuffling your output, so don't use it. xargs -n1 launches one awk process for each file, whereas without n1, xargs launches many fewer awk processes, like this:

files | xargs -n1 awk
=>
awk file1
awk file2
...
awk fileN

vs

files | xargs awk
=>
awk file1 file2 ... fileN # or broken into a few awk commands if many files

i ran your code on ~20k text files each ~20k in size with and without -n1 -P6:

with -n1 -P6  23.138s
without        3.356s

if you want parallelism without xargs's stdout shuffling, use gnu parallel (also suggested by Gordon Davisson), e.g.:

find /dir/ -type f -print0 | parallel --xargs -0 -q awk -F"|" '$2 ~ /string/{ print $3"|"$7 }'

note: -q is necessary to quote the command string, otherwise the quotes in -F"|" and around the awk code become unquoted when parallel runs them.

parallel saves a bit of time, but not as much as ditching -n1 did:

parallel       1.704s

ps: introducing a cat (which Matt does in his answer) is a tiny faster than just xargs awk:

xargs awk        3.356s
xargs cat | awk  3.036s
webb
  • 4,180
  • 1
  • 17
  • 26
  • 1
    Hmm i should have ditched the `-n1` I think that was left over from an attempt to un mangle the data. I don't mind if the order of the output is mixed. I do mind when the output is "corrupted" as in, half a line gets written and the other half gets written onto another line... Either way i'll attempt your suggestions and report the results. – A Brothers Dec 29 '16 at 15:08
0

I would just do the following:

cat /${dir}/* | awk '$2 ~ /string*/{ print $3 "|" $7 }' >> `date`.txt

Where the file is named after the date and time in which the process was run.

Benjamin W.
  • 46,058
  • 19
  • 106
  • 116
Matt
  • 1
  • 1
    I could be wrong but will this break if there is a directory inside of ${dir}? Like the OP has done, using "find -f" is usually a good way to get only files. It will even find them recursively which cat and a glob pattern won't do. – diametralpitch Dec 28 '16 at 23:37
  • 1
    This answer ignores the requirement that we are attempting to run multiple awk commands in order to increase the speed of selecting out "string" from the files. – A Brothers Dec 29 '16 at 00:50
  • It does not descend into sub directories (which was not requested). We neglect the need for executing find type -f this way. – Matt Dec 29 '16 at 18:16
  • My version took .003 s on a test, yours took .079s...so the above is 20x faster for my test. – Matt Dec 29 '16 at 18:20