0

I am working on a project that require me to take some .bed in input, extract one column from each file, take only certain parameters and count how many of them there are for each file. I am extremely inexperienced with bash so I don't know most of the commands. But with this line of code it should do the trick.

for FILE in *; do cat $FILE | awk '$9>1.3'| wc -l ; done>/home/parallels/Desktop/EP_Cell_Type.xls

I saved those values in a .xls since I need to do some graphs with them. Now I would like to take the filenames with -ls and save them in the first column of my .xls while my parameters should be in the 2nd column of my excel file. I managed to save everything in one column with the command:

ls>/home/parallels/Desktop/EP_Cell_Type.xls | for FILE in *; do cat $FILE | awk '$9>1.3'-x| wc -l ; done >>/home/parallels/Desktop/EP_Cell_Type.xls

My sample files are:A549.bed, GM12878.bed, H1.bed, HeLa-S3.bed, HepG2.bed, Ishikawa.bed, K562.bed, MCF-7.bed, SK-N-SH.bed and are contained in a folder with those files only.

The output is the list of all filenames and the values on the same column like this:

Column 1
A549.bed
GM12878.bed
H1.bed
HeLa-S3.bed
HepG2.bed
Ishikawa.bed
K562.bed
MCF-7.bed
SK-N-SH.bed
4536
8846
6754
14880
25440
14905
22721
8760
28286

but what I need should be something like this:

Filenames #BS
A549.bed 4536
GM12878.bed 8846
H1.bed 6754
HeLa-S3.bed 14880
HepG2.bed 25440
Ishikawa.bed 14905
K562.bed 22721
MCF-7.bed 8760
SK-N-SH.bed 28286
  • If you already have an existing file coming from both command then you can just `pr -t2 -s' ' /home/parallels/Desktop/EP_Cell_Type.xls` Although your code is another story to tell :-) – Jetchisel Mar 28 '21 at 11:33
  • when you open the `.xls` file does it automatically load the cells as you've displayed (`Desired Output`) or does the program (eg, `Excel`) ask you for a field delimiter and then loads the data into cells? it seems (to me) that what you want is to generate some sort of delimited output with 2 columns ... 1st column == filename / 2nd column == a count of 'matching' rows ... is this correct? – markp-fuso Mar 28 '21 at 18:09
  • Ok so `pr -t2 -s' ' /home/parallels/Desktop/EP_Cell_Type.xls` solved my problem but I have to save twice the same file, I hoped it was possible to change the output before It was saved as xls. What do you mean that my code is another story to tell? – aleleo97ao Mar 29 '21 at 08:36
  • @markp-fuso you're right, Excel ask me for a field delimiter and then loads the data. My desired output wold be as you described, 1st column == filename / 2nd column == a count of 'matching' rows – aleleo97ao Mar 29 '21 at 08:38
  • please update the question with sample input, the (wrong) output generated by your code and the (correct) output you're expecting; do not post this info as images as most of us are not going to take the time to convert images to text that we can use in our testing and answers – markp-fuso Mar 29 '21 at 11:54
  • Sorry, you're right. I'll post the output as text and delete the picture. – aleleo97ao Mar 30 '21 at 15:02

1 Answers1

0

Assuming OP's awk program (correctly) finds all of the desired rows, an easier (and faster) solution can be written completely in awk.

One awk solution that keeps track of the number of matching rows and then prints the filename and line count:

awk '
FNR==1 { if ( count >= 1 )                       # first line of new file? if line counter > 0
             printf "%s\t%d\n", prevFN, count   # then print previous FILENAME + tab + line count
         count=0                                # then reset our line counter
         prevFN=FILENAME                        # and save the current FILENAME for later printing
       }

$9>1.3 { count++ }                              # if field #9 > 1.3 then increment line counter

END    { if ( count >= 1 )                       # flush last FILENAME/line counter to stdout
             printf "%s\t%d\n", prevFN, count
       }
' *                                             # * ==> pass all files as input to awk

For testing purposes I replaced $9>1.3 with /do/ (match any line containing the string 'do') and ran against a directory containing an assortment of scripts and data files. This generated the following tab-delimited output:

bigfile.txt     7
blocker_tree.sql        4
git.bash        2
hist.bash       4
host.bash       2
lines.awk       2
local.sh        3
multi_file.awk  2
markp-fuso
  • 28,790
  • 4
  • 16
  • 36
  • Update: It woks perfectly, thank you kind sir! Now I just need to understand the basic behind it, as it seems rocket science to me. I don't understand the last two lines, why do you have to do another if ( count >= 1) printf "%s\t%d\n", prevFN, count. Doesn't the first if... print.... already achieve the goal of printing the name and the count? – aleleo97ao Mar 30 '21 at 15:59
  • the first `printf` call is made upon seeing a new input file ... so the `printf` is actually printing the results of the **last** file; the `END` block code is called after the last file has been processed and the `if` test is checking to see if that last file actually had any matching lines (ie, `count >= 1`) ... if `count=0` then we know no lines matched and therefore we do not want to print any info about that file – markp-fuso Mar 30 '21 at 16:10
  • Ok, so I think I understood everything correctly. My last question is, knowing that in every file I'll have at least one or more lines that comply with the if condition, can I shorten the code by removing some lines of the code? – aleleo97ao Mar 30 '21 at 17:22
  • if you know with **100% certainty** that all files will have at least one match then you could remove the 2x clauses - `if ( count >= 1 )`; other than that ... you're welcome to play with and edit/modify the code as you desire; "*When in doubt, try it out!*" – markp-fuso Mar 30 '21 at 17:38