0

I have a file called "align_summary.txt" which looks like this:

Left reads:


Input     :  26410324

   Mapped   :  21366875 (80.9% of input)

   of these:    451504 ( 2.1%) have multiple alignments (4372 have >20)

...more text....

... and several more lines of text....

I want to pull out the % of multiple alignments among all left aligned reads (in this case it's 2.1) in bash shell.

If I use this:

 pcregrep -M "Left reads.\n..+.\n.\s+Mapped.+.\n.\s+of these" align_summary.txt | awk -F"\\\( " '{print $2}' | awk -F"%" '{print $1}' | sed -n 4p

It promptly gives me the output: 2.1

However, if I enclose the same expression in backticks like this:

leftmultiple=`pcregrep -M "Left reads.\n..+.\n.\s+Mapped.+.\n.\s+of these" align_summary.txt | awk -F"\\\( " '{print $2}' | awk -F"%" '{print $1}' | sed -n 4p`

I receive an error:

awk: syntax error in regular expression (  at 
  input record number 1, file 
  source line number 1

As I understand it, enclosing this expression in backticks affects the interpretation of the regular expression that includes "(" symbol, despite the fact that it is escaped by backslashes.

Why does this happen and how to avoid this error?

I would be grateful for any input and suggestions.

Many thanks,

D. Kazmin
  • 1
  • 2

2 Answers2

0

Just use awk:

leftmultiple=$(awk '/these:.*multiple/{sub(" ","",$2);print $2}' FS='[(%]' align_summary.txt )
Juan Diego Godoy Robles
  • 14,447
  • 2
  • 38
  • 52
0

Always use $(...) instead of backticks but more importantly, just use awk alone:

$ leftmultiple=$( gawk -v RS='^$' 'match($0,/Left reads.\s*\n\s+.+\n\s+Mapped.+.\n.\s+of these[^(]+[(]\s*([^)%]+)/,a) { print a[1] }' align_summary.txt )
$ echo "$leftmultiple"
2.1

The above uses GNU awk 4.* and assumes you do need the complicated regexp that you were using to avoid false matches elsewhere in your input file. If that's not the case then the script can of course get much simpler.

Ed Morton
  • 188,023
  • 17
  • 78
  • 185