-2

I would like to use awk to filter out a multi valued column.

My data is two columned with the delimiter ;. The second column has three float values separated with white spaces.

randUni15799:1;0.00 0.00 0.00
randUni1785:1;0.00 0.00 0.00
randUni18335:1;0.00 0.00 0.00
randUni18368:1;223.67 219.17 0.00
randUni18438:1;43.71 38.71 1.52

What I want to achieve is the following. I want to filter all rows that the first and second value of the second column is bigger than 200.

randUni18368:1;223.67 219.17 0.00

Update: With help from the comments, I tried this and worked

awk -F ";" '{split($2, a, " "); if (a[1] > 200 && a[2] > 200) print}'
  • @RAREKpopManifesto I have about 5gb of text files to process. – Min Hee Jo Oct 24 '22 at 11:16
  • Welcome to SO. Thank you for asking a relevant question with source data. Advise: Google how to do the following in AWK: 1) [split string](https://www.poftut.com/awk-text-split-delimit-examples/) (your 3rd column), 2) [convert string to float](https://www.unix.com/shell-programming-and-scripting/163304-awk-string-number-conversion.html). Then [build yourself a function](https://www.tutorialspoint.com/awk/awk_user_defined_functions.htm) in awk to evaluate column 3, and print the line if succesful. – MyICQ Oct 24 '22 at 11:51
  • @MyICQ Thank you <3 I did `awk -F ";" '{split($2, a, " "); if (a[1] > 200 && a[2] > 200) print}'`. I didn't need to covert to float tho. – Min Hee Jo Oct 24 '22 at 12:03
  • Excellent. May I ask if you are using dot as decimal separator in your locale ? I could imagine this being different in other locales that use comma. – MyICQ Oct 24 '22 at 19:25
  • @MyICQ We do use comma, but all the files are in points. – Min Hee Jo Oct 25 '22 at 18:56

2 Answers2

0

One awk idea:

awk -F';' '{ n=split($2,a,/[[:space:]]+/)            # split 2nd field on spaces; place values in array a[]
             if (a[1] > 200 && a[2] > 200)           # if 1st and 2nd array entries > 200 then ...
                print                                # print current line to stdout
           }
' randum.dat

# or as a one-liner

awk -F';' '{ n=split($2,a,/[[:space:]]+/); if (a[1] > 200 && a[2] > 200) print}' randum.dat

# reduced further based on OP's comments/questoins:

awk -F';' '{ split($2,a," "); if (a[1] > 200 && a[2] > 200) print}' randum.dat

This generates:

randUni18368:1;223.67 219.17 0.00
markp-fuso
  • 28,790
  • 4
  • 16
  • 36
  • Question 1. Why do I have to save it to `n`? Question 2. Is there a difference between `' '` and `/[[:space:]]+/`? – Min Hee Jo Oct 24 '22 at 13:37
  • you can remove `n` since we're dealing with a fixed number of positions; if you know for a fact that the spacing will be exactly one space then, yes, you can replace `[[:space:]]+` with `" "`); the [split() defintion](https://www.gnu.org/software/gawk/manual/html_node/String-Functions.html) mentions using a regex hence the `/[[:space:]]+/` but you're welcome to use a string in this case; I've updated the answer with the changes (based on previously mentioned assumptions) – markp-fuso Oct 24 '22 at 13:42
0

I would harness GNU AWK for this task following way, let file.txt content be

randUni15799:1;0.00 0.00 0.00
randUni1785:1;0.00 0.00 0.00
randUni18335:1;0.00 0.00 0.00
randUni18368:1;223.67 219.17 0.00
randUni18438:1;43.71 38.71 1.52

then

awk '(gensub(/.*;/,"",$1)+0)>200&&$2>200' file.txt

gives output

randUni18368:1;223.67 219.17 0.00

Explanation: I use gensub function on so anything up to ; including ; is replaced by empty string, i.e. removed and then returned. Observe that gensub does not alter $1 in this process, then add zero to convert that into number check if that is more than 200 and 2nd field is bigger than 200.

(tested in gawk 4.2.1)

Daweo
  • 31,313
  • 3
  • 12
  • 25