5

I'm trying to find a string pattern composed of the word CONCLUSION followed by the value of field $2 and field $3 from the same record in field $5.

For example, my_file.txt is separated by "|":

1|substance1|substance2|red|CONCLUSIONS: the effect of SUBSTANCE1 and SUBSTANCE2 in humans...|
2|substance3|substance4|red|Conclusions: Substance4 is not harmful...|
3|substance5|substance6|red|Substance5 interacts with substance6...|

So in this example I only want the first record to be printed because it has the word "CONCLUSIONS" followed by substance1 followed by substance2.

This is what I'm trying but it's not working:

awk 'BEGIN{FS="|";IGNORECASE=1}{if ($5 ~ /CONCLUSIONS.*$2.*$3/) {print $0}}' my_file.txt

Any help is much appreciated

Hallucigeniak
  • 155
  • 1
  • 7

1 Answers1

5
$ awk 'BEGIN{FS="|";IGNORECASE=1} $5 ~ "conclusions.*" $2 ".*" $3' my_file.txt
1|substance1|substance2|red|CONCLUSIONS: the effect of SUBSTANCE1 and SUBSTANCE2 in humans...|

How It Works

  • BEGIN{FS="|";IGNORECASE=1}

    This part is unchanged from the code in the question.

  • $5 ~ "conclusions.*" $2 ".*" $3

    This is a condition: it is true if $5 matches a regex composed of four strings concatenated together: "conclusions.*", and $2, and ".*", and $3.

    We have specified no action for this condition. Consequently, if the condition is true, awk performs the default action which is to print the line.

Simpler Examples

Consider:

$ echo "aa aa" | awk '$2 ~ /$1/'

This line prints nothing because awk does not substitute in for variables inside a regex.

Observe that no match is found here either:

$ echo '$1' | awk '$0 ~ /$1/'

There is no match here because, inside a regex,$ matches only at the end of a line. So, /$1/ would only match the end of a line followed by a 1. If we want to get a match here, we need to escape the dollar sign:

$ echo '$1' | awk '$0 ~ /\$1/'
$1

To get a regex that uses awk variables, we can, as is the basis for this answer, do the following:

$ echo "aa aa" | awk '$2 ~ $1'
aa aa

This does successfully yield a match.

A Further Improvement

As Ed Morton suggests in the comments, it might be important to insist that the substances match only on whole words. In that case, we can use \\<...\\> to limit substance matches to whole words. Thus:

awk 'BEGIN{FS="|";IGNORECASE=1} $5 ~ "conclusions.*\\<" $2 "\\>.*\\<" $3 "\\>"' my_file.txt

In this way, substance1 will not match substance10.

John1024
  • 109,961
  • 14
  • 137
  • 171
  • 1
    If I understand correctly, this version works while the OP's version did not because an expression between slashes is a [regexp constant](https://www.gnu.org/software/gawk/manual/html_node/Regexp-Usage.html#Regexp-Usage) and therefore the content of `$2` and `$3` was not substituted in and the comparison was against a literal `$2` and `$3`, which did not succeed. – Simon Feb 20 '15 at 03:02
  • @Simon Correct. And, I just added a section to the answer expanding on that. – John1024 Feb 20 '15 at 03:20
  • 1
    I would have expected the desired condition to be more like `$5~"\\" && $5~"\\<"$2"\\>" && $5~"\\<"$3"\\"'` so it wouldn't matter if the order was substance1 then substance2 or vice-versa and it wouldn't falsely match when $3="substance1" and $5 contains "substance17". – Ed Morton Feb 20 '15 at 05:13
  • 1
    @EdMorton Excellent suggestion on the whole word matches! As for substance order the OP's problem statement, as I read it, asked for that specific order. His sample code was consistent with that. If he meant otherwise, then your code is the way to go. – John1024 Feb 20 '15 at 05:40