2

I am trying to filter a file using values in the 8 column >= 10. I am using awk but for some reason it doesn't work. Am I doing something wrong, what am I missing?

head df_TPM.csv
LQNS02136402.1_14821_3p,12680.71611,11346.42368,11686.28693,9067.797819,7429.467928,5551.660333,3246.956281
LQNS02000137.1_325_3p,8342.540984,5905.726173,4503.363041,3616.191278,3142.965662,3678.829299,6288.621969
LQNS02278148.1_40791_3p,4921.502758,2461.882836,429.824973,261.273116,132.0239748,68.6191655,70.8815385
LQNS02278089.1_34112_3p,4246.71324,4584.529009,8687.922574,7570.83746,5801.384953,2870.020801,734.3131465
LQNS02278075.1_32377_5p,4143.547577,4093.91803,10804.12323,10062.99269,7925.240969,4712.484455,1080.915573
LQNS02138569.1_14892_3p,2668.27957,2160.173542,837.2584183,233.2310273,84.62362925,64.6037895,23.456714
LQNS02278075.1_32324_5p,2331.608924,491.8868983,1527.312199,881.8683105,747.1474225,347.397634,74.07259175
LQNS02278075.1_32382_3p,2140.686095,2439.122353,10837.38169,12569.95295,9385.530878,6022.323737,1705.900969
LQNS02000138.1_777_5p,1819.275149,1762.009649,8565.396754,33280.90019,32176.07604,15849.37306,11872.99383
LQNS02278186.1_47223_3p,1687.843418,728.4288968,1328.048172,1306.424238,2102.27342,14.78892225,9.92647375

#Extract column 1 and 8 and print if $8>=10
cat df_TPM.csv |awk -F"," '{print $1, $8}' | grep -E "^LQN" | awk -F " " '$2>= 10'

LQNS02276925.1_23356_5p 5.352369
LQNS02277221.1_25158_5p 2.82778125
LQNS02277812.1_29775_3p 11.1090745
LQNS02278074.1_32154_3p 6.124789
LQNS02278139.1_39525_5p 22.6656355

#As you can see lots of numbers shouldn't be there (ex: 2.82778125 < 10)
halfer
  • 19,824
  • 17
  • 99
  • 186
  • Check https://ideone.com/SzLsIL, it looks fine if you just use `awk '$2 >= 10'` on your data. – Wiktor Stribiżew Dec 31 '20 at 15:22
  • Well, actually, `awk -F"," '/^LQN/ && $8 >= 10 {print $1, $8}' df_TPM.csv` should work for you. See https://ideone.com/LSBRGY – Wiktor Stribiżew Dec 31 '20 at 15:25
  • why so many "pipes"? ```awk -F, '/^LQN/ && $8>=10 {print $1,$8}' df_TPM.csv``` – vgersh99 Dec 31 '20 at 15:26
  • Because I am bad, you are right. But still prints things with values < 10. Can my awk be wrong? 9.99 does work fine, but any number greater than 10 is problematic. – Amaranta_Remedios Dec 31 '20 at 15:29
  • 2
    What if you do `$8+0 >= 10`? `awk -F"," '/^LQN/ && $8+0 >= 10 {print $1, $8}' df_TPM.csv`? – Wiktor Stribiżew Dec 31 '20 at 15:31
  • 1
    @Amaranta_Remedios, works fine for me for your sample file: getting 9 lines out of 10; skipping the one with ```9.92647375```. Do you have ^M-s in the file? – vgersh99 Dec 31 '20 at 15:31
  • @WiktorStribiżew that seems to work. I don't understand what's wrong with my awk or file – Amaranta_Remedios Dec 31 '20 at 15:32
  • 1
    @Amaranta_Remedios, your code works for me, could you please check once `cat -v file` if you have any control M characters in your Input_file once? – RavinderSingh13 Dec 31 '20 at 15:37
  • 2
    @vgersh99 is almost certainly correct and the problem is DOS line endings causing $8 to end in `^M` and so not look like a number to awk. You can verify that by doing a numeric comparison with any other field than the last one on the line. See https://stackoverflow.com/questions/45772525/why-does-my-tool-output-overwrite-itself-and-how-do-i-fix-it?noredirect=1#comment97778025_45772525 for how to deal with that. – Ed Morton Dec 31 '20 at 15:55
  • @vgersh99 that was the problem indeed. Solved using the dos2unix – Amaranta_Remedios Dec 31 '20 at 16:02

2 Answers2

2

By seeing OP's comment, in case you don't want to search for LQN text in starting of line and want to check if 8th column is greater than 10 then try following(to check if lines have LQN remove ! from following codes).

awk -F"," '$8+0 >= 10 && !/^LQN/{print $1, $8}' df_TPM.csv

OR To get total lines try: counting those matched lines could be done in a single awk itself.

awk -F"," '$8+0 >= 10 && !/^LQN/{count++} END{print count}' df_TPM.csv

Explanation: Adding detailed explanation for above.

awk -F"," '               ##Starting awk program from here.
$8+0 >= 10 && !/^LQN/{    ##Checking condition if 8th field is greater than 10 and NOT LQN.
  count++                 ##Increasing count with 1 here.
}
END{                      ##Starting END block of this awk program from here.
  print count             ##Printing count value here.
}
' df_TPM.csv              ##Mentioning Input_file name here.

To handle control M characters in awk code itself try: considering that you don't want control M characters in your Input_file.

awk -F"," '{gsub(/\r/,"")} $8 >= 10 && !/^LQN/{count++} END{print count}' df_TPM.csv
RavinderSingh13
  • 130,504
  • 14
  • 57
  • 93
1

You need to tell your awk to coerce $8 into a number by computing $8+0. It is recommended that you ensure you have GNU awk installed to avoid issues. Also, you may probably use dos2unix before working on the files to normalize the line endings.

The whole command can be written as

awk -F"," '/^LQN/ && $8+0 >= 10 {print $1, $8}' df_TPM.csv

See the online awk demo.

NOTE: To only count these lines, use The whole command can be written as

awk -F, '/^LQN/ && $8+0 >= 10 {cnt++} END{print cnt}' df_TPM.csv

To find the lines that do not start with LQN, just add the negation operator ! before /^LQN/:

awk -F, '!/^LQN/ && $8+0 >= 10 {cnt++} END{print cnt}' df_TPM.csv

Details

  • -F"," (= -F,) - set the field separator to a comma
  • /^LQN/ && $8+0 >= 10 - if the current line starts with LQN and the eighth field is equal or larger than 10
  • !/^LQN/ && $8+0 >= 10 - if the current line does not start with LQN and the eighth field is equal or larger than 10
  • {print $1, $8} - print Field 1 and 8
  • {cnt++} - increment the cnt variable
  • END{print cnt} - print the cnt variable once the awk finishes processing lines.
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • Thanks, I just assumed awk would know that was a number. That was super useful! – Amaranta_Remedios Dec 31 '20 at 15:41
  • Also I had to write it like this ```grep -e "^LQN" df_TPM.csv | awk -F, '$8+0 >= 10 {print $1,$8}'| wc -l``` because i also want to get the ones that don't start with LQN and i know how to do that with grep -v but not with awk – Amaranta_Remedios Dec 31 '20 at 15:42
  • @Amaranta_Remedios I understand `gawk` does understand that since you have seen the [previous test](https://ideone.com/SzLsIL) using your code. You must have some other version. – Wiktor Stribiżew Dec 31 '20 at 15:42
  • Thanks, awk version 20070501 – Amaranta_Remedios Dec 31 '20 at 15:43
  • 1
    With `awk`, you just use `awk -F, '!/^LQN/ && $8+0 >= 10 {print $1, $8}' file | wc -l`. – Wiktor Stribiżew Dec 31 '20 at 15:44
  • Thanks. I didn't know i could use ! in awk too. – Amaranta_Remedios Dec 31 '20 at 15:51
  • 3
    If you need to add `0` to that then the version of awk you are running is broken and you should find a new one because other things may also be broken that you haven't encountered yet. I suspect instead that [@vgersh99 is correct](https://stackoverflow.com/questions/65522666/filter-a-file-using-a-column-value-greater-than-a-number-awk-not-working#comment115842223_65522666) and you have control-Ms in your file and doing the `+0` is just working around that problem manifesting because the number you're interested in is at the end of the line. – Ed Morton Dec 31 '20 at 15:53