How awk the filename as a column in the output?

Question

I am trying to perform some grep in contents of several files in a directory and appending my grep match in a single file, in my output I would also want a column which will have the filename as well to understand from which files that entry was picked up. I was trying to use awk for the same but it did not work.

for i in *_2.5kb.txt; do more $i | grep "NM_001080771" | echo `basename $i` | awk -F'[_.]' '{print $1"_"$2}' | head >> prom_genes_2.5kb.txt; done

files names are like this , I have around 50 files

    48hrs_CT_merged_peaks_2.5kb.txt
    48hrs_TAMO_merged_peaks_2.5kb.txt
    72hrs_TAMO_merged_peaks_2.5kb.txt
    72hrs_CT_merged_peaks_2.5kb.txt
    5D_CT_merged_peaks_2.5kb.txt
    5D_TAMO_merged_peaks_2.5kb.txt

each file contents several lines

chr1    3663275 3663483 14  2.55788 2.99631 1.40767 NM_001011874    -
chr1    4481687 4488063 264 7.85098 28.25170    26.41094    NM_011441   -
chr1    5008006 5013929 243 8.20677 26.17854    24.37907    NM_021374   -
chr1    5578362 5579949 65  3.48568 7.83501 6.57570 NM_011011   +
chr1    5905702 5908002 148 5.84647 16.53171    14.88463    NM_010342   -
chr1    9288507 9290352 77  4.04459 9.12442 7.77642 NM_027671   -
chr1    9291742 9292528 142 5.74749 16.21792    14.28185    NM_027671   -
chr1    9535689 9536176 72  4.45286 8.82567 7.29563 NM_021511   +
chr1    9535689 9536176 72  4.45286 8.82567 7.29563 NM_175236   +
chr1    9535689 9536176 72  4.45286 8.82567 7.29563 NR_027664   +

When I am getting a match for "NM_001080771" I am printing the entire content of that line to a new file and for each file this operation is being done and appending the match to one output file. I also want to add a column with filename as shown above in the final output so that I know from which file I am getting the entries.

desired output

chr4    21610972    21618492    193 7.28409 21.01724    19.35525    NM_001080771    -   48hrs_CT
chr4    21605096    21618696    76  4.22442 9.32981 7.68131 NM_001080771    -   48hrs_TAMO
chr4    21604864    21618713    12  1.78194 2.36793 1.25883 NM_001080771    -   72hrs_CT
chr4    21610305    21615717    26  2.90579 4.47333 2.65353 NM_001080771    -   72hrs_TAMO
chr4    21609924    21618600    23  2.63778 4.0642  2.33685 NM_001080771    -   5D_CT
chr4    21609936    21618680    30  5.63778 3.0642  8.33685 NM_001080771    -   5D_TAMO

This is not working. I want to basically append a column where the filename should also get added as an entry either first or the last column. How to do that?

You need to show us a sample of your input and the desired output, otherwise it's difficult for us to help you. — Tom Fenech, Mar 03 '16 at 15:17

karakfa · Answer 1 · 2016-03-03T15:39:21.980

4

or you can do all in awk

 awk '/NM_001080771/ {print $0, FILENAME}' *_2.5kb.txt

this trims the filename in the desired format

$ awk '/NM_001080771/{sub(/_merged_peaks_2.5kb.txt/,"",FILENAME); 
                      print $0, FILENAME}' *_2.5kb.txt

edited Mar 03 '16 at 15:39

answered Mar 03 '16 at 15:33

karakfa

66,216
7
41
56

works with current number of files, but yes prints the entire filename, I was just trying add a subset of the name like `48hrs_TAMO`. Somewhat serves the purpose. I can do with it as of now. Thanks everyone. Since it is not the exact answer , I am not accepting it as an answer. Is that fine? – ivivek_ngs Mar 03 '16 at 15:48
It's fine but did you check the second script which removes the suffix? – karakfa Mar 03 '16 at 15:57
ah yes am so sorry , the second one also works totally fine. Yes I accept it now. Thanks a lot. I have put in a script based so that the operation can be used in large number of files as well. Thanks a lot. – ivivek_ngs Mar 03 '16 at 16:00
@vchris_ngs this is THE right answer to your question. – Ed Morton Mar 03 '16 at 17:26
Until `*_2.5kb.txt` expands too large for your shell to handle (`/usr/bin/awk: Argument list too long.`), that is... – John Hascall Mar 03 '16 at 17:34
I feel both the answers are correct. But I cannot accept both as answers. The system does not allow me. And John is correct if the number of files are too large for shell to handle then it is important to run it as a script based as John showed else @karakfa second answer is also correct. – ivivek_ngs Mar 03 '16 at 18:02

John Hascall · Accepted Answer · 2016-03-03T17:27:00.220

As long as the number of files is not huge, why not just:

grep NM_001080771 *_2.5kb.txt | awk -F: '{print $2,$1}'

If you have too many files for that to work, here's a script-based approach that uses awk to append the filename:

#!/bin/sh
for i in *_2.5kb.txt; do
    < $i grep "NM_001080771" | \
        awk -v where=`basename $i` '{print $0,where}'
done

./thatscript | head > prom_genes_2.5kb.txt

Here we are using awk's -v VAR=VALUE command line feature to pass in the filename (because we are using stdin we don't have anything useful in awk's built-in FILENAME variable).

You can also use such a loop around @karakfa's elegant awk-only approach:

#!/bin/sh
for i in *_2.5kb.txt; do
    awk '/NM_001080771/ {print $0, FILENAME}' $i
done

And finally, here's a version with the desired filename munging:

#!/bin/sh
for i in *_2.5kb.txt; do
      awk -v TAG=${i%_merged_peaks_2.5kb.txt} '/NM_001080771/ {print $0, TAG}' $i
done

(this uses the shell's variable substitution ${variable%pattern} to trim pattern from the end of variable)

Bonus

Guessing you might want to search for other strings in the future, so why don't we pass in the search string like so:

#!/bin/sh
what=${1?Need search string}
for i in *_2.5kb.txt; do
  awk -v TAG=${i%_merged_peaks_2.5kb.txt} /${what}/' {print $0, TAG}' $i
done

./thatscript NM_001080771 | head > prom_genes_2.5kb.txt

YET ANOTHER EDIT

Or if you have a pathological need to over-complicate and pedantically quote things, even in 5-line "throwaway" scripts:

#!/bin/sh
shopt -s nullglob

what="${1?Need search string}"
filematch="*_2.5kb.txt"
trimsuffix="_merged_peaks_2.5kb.txt"

for filename in $filematch; do
    awk -v tag="${filename%${trimsuffix}}" \
        -v what="${what}" \
        '$0 ~ what {print $0, tag}' $filename
done

the shell script works fine since I have around 50 files but I would like to use it also later for lets say 1k files in a directory. Only thing is it prints the entire filename I was looking for a part of the file name — ivivek_ngs, Mar 03 '16 at 15:39
@vchris_ngs OK, the final example includes the filename chopping you want. — John Hascall, Mar 03 '16 at 15:50
sorry works both the answers but I cannot accept both, but the second answer for karakfa also works so if John can accept is as well. It seems I am not entitled to accept both as answers. — ivivek_ngs, Mar 03 '16 at 16:02
The above is the wrong approach and so will be slow and buggy because of that. Never write a shell loop just to manipulate text. Always quote your shell variables. Don't use deprecated backticks. Never allow shell variables to expand to become part of an awk script. Don't use all-upper case variable names in awk. — Ed Morton, Mar 03 '16 at 16:17

How awk the filename as a column in the output?

2 Answers2

Bonus

YET ANOTHER EDIT