2

I am trying to get a specific attribute from a line that is returned from the join command. My code to gunzip two files (without saving to disk) and then do a join on them is:

join <(gunzip -c fileA.gz) <(gunzip -c fileB.gz) -t $'|'

The -t $'|' is because the *.gz files are delimited by '|' instead of whitespace. I can use:

awk 'BEGIN {FS="|"};{print $1}'

To get the first field on each line normally, but I'm unsure if join is outputting the returned matches as a batch or per line... if it's per line how can I pause it to grab that first attribute and do a comparison (such as whether to continue looking at more lines)?

Any advice is appreciated.

gjw80
  • 1,058
  • 4
  • 18
  • 36
  • I do not understand the queston. `join` reads the files line by line, that's why you need the input files to be sorted. – choroba Apr 04 '13 at 15:31
  • how do I stop once Ive reached a certain value in the output is my question – gjw80 Apr 04 '13 at 15:44
  • Well, it depends on what you want to do. Normally, you pipe the output of `join` to something that does the comparison line by line without pausing anything. – choroba Apr 04 '13 at 15:50
  • I want to compare the first attribute of the line against an outside value (a max value variable basically). I know how to get the attribute using awk, just not how to get each line of output... would tee work? Named pipe? Anonymous pipe? – gjw80 Apr 04 '13 at 15:54

1 Answers1

1
marker="foo"
join <(gunzip -c fileA.gz) <(gunzip -c fileB.gz) -t $'|' | awk -F '|' '{print; if ($1=="'"${marker}"'") exit}'

This will output lines until the first field is equal to the value of $marker, then stop.

If you're looking to output just the line with the marker, use grep:

join <(gunzip -c fileA.gz) <(gunzip -c fileB.gz) -t $'|' | grep "^${marker}|"

Update:
If your marker is an integer (say, 100) and you're trying to stop at or beyond the marker (i.e. any number >= 100 is a valid marker), use this:

marker=100
join <(gunzip -c fileA.gz) <(gunzip -c fileB.gz) -t $'|' | awk -F '|' '{print; if ($1>='"${marker}"') exit}'
Sir Athos
  • 9,403
  • 2
  • 22
  • 23
  • your first snippet is closer to what I'm looking for, but it doesn't seem to be stopping at the $marker... – gjw80 Apr 05 '13 at 19:33
  • There was an error which made it break if the marker contained spaces. Other than that, you might have set marker incorrectly. Edited answer to fix the space problem. – Sir Athos Apr 05 '13 at 20:03
  • in setting the marker I replaced "marker" in your code with the variable name containing the value to be matched... still does not seem to be working – gjw80 Apr 05 '13 at 20:36
  • You can't just replace "marker", it's supposed to be a variable. Either run this before my code: `marker="text to stop at"` or replace `${marker}` with your text. – Sir Athos Apr 05 '13 at 22:49
  • OK, so I think that works for finding the line actually if it matches. Do you happen to have an idea of why when I change the == to > Im getting'g binary output instead of ascii? I'd like to change the comparison to 'greater than' as opposed to 'equals' for the exit condition. What happens if a number and string are compared? – gjw80 Apr 08 '13 at 14:59
  • The == condition will print lines _until_ the condition is met, not just the line that matches the condition. If it's not doing that for you, please post your actual line and we'll take a look at it. – Sir Athos Apr 08 '13 at 16:21
  • 1
    Right, but the problem with that is the value to match isn't necessarily even in the second joined file. For instance, if the value is 100, then I want to stop joining and exit once I read a value in $1 that is greater than 100. 100 may not be in there, but if 101 is it should exit at that instead. That's why I want to use the greater than operator. If nothing in $1 is 100 or higher than it should join the entire file. The primary purpose in this is to trim our data set we're getting from these joined files by setting a max value and only joining on attribute values up to that... – gjw80 Apr 10 '13 at 13:18
  • If you compare integers instead of strings, you should remove the outer quotes around the marker, as in `$1=='"${marker}"'` instead of `$1=="'"${marker}"'"`; I just tried this simple example and it seems to work as intended: `printf '99:aaa\n100:bbb\n101:ccc\n102:ddd\n' | awk -F : '{print; if ($1 >= 100) exit}'` ( – Sir Athos Apr 11 '13 at 16:39
  • It's very very strange... the values all appear correct but the comparison doesn't seem to take affect... it never exits when the current value is > the max value.... – gjw80 Apr 11 '13 at 20:34