$ awk 'BEGIN{FS="|";IGNORECASE=1} $5 ~ "conclusions.*" $2 ".*" $3' my_file.txt
1|substance1|substance2|red|CONCLUSIONS: the effect of SUBSTANCE1 and SUBSTANCE2 in humans...|
How It Works
BEGIN{FS="|";IGNORECASE=1}
This part is unchanged from the code in the question.
$5 ~ "conclusions.*" $2 ".*" $3
This is a condition: it is true if $5
matches a regex composed of four strings concatenated together: "conclusions.*"
, and $2
, and ".*"
, and $3
.
We have specified no action for this condition. Consequently, if the condition is true, awk
performs the default action which is to print the line.
Simpler Examples
Consider:
$ echo "aa aa" | awk '$2 ~ /$1/'
This line prints nothing because awk
does not substitute in for variables inside a regex.
Observe that no match is found here either:
$ echo '$1' | awk '$0 ~ /$1/'
There is no match here because, inside a regex,$
matches only at the end of a line. So, /$1/
would only match the end of a line followed by a 1
. If we want to get a match here, we need to escape the dollar sign:
$ echo '$1' | awk '$0 ~ /\$1/'
$1
To get a regex that uses awk variables, we can, as is the basis for this answer, do the following:
$ echo "aa aa" | awk '$2 ~ $1'
aa aa
This does successfully yield a match.
A Further Improvement
As Ed Morton suggests in the comments, it might be important to insist that the substances match only on whole words. In that case, we can use \\<...\\>
to limit substance matches to whole words. Thus:
awk 'BEGIN{FS="|";IGNORECASE=1} $5 ~ "conclusions.*\\<" $2 "\\>.*\\<" $3 "\\>"' my_file.txt
In this way, substance1
will not match substance10
.