0

I made a script to search a huge csv file and find the desired fields to print, store them into an array, and then passed this array into awk to print them thanks to an answer found here:

  awk -F,                                         \
        -v _sourcefile="$i"                         \
        -v title="\"${k}\""                         \
        -v box="_${j}_"                             \
        -v score="$dock_score_column"               \
        -v xp_terms_columns="${xp_terms_columns[*]}" \
    '
        BEGIN {
            nxp  = split(xp_terms_columns,xp," ")
            nfmt = split("%-8s %s %9s %9s %8s %10s %7s %10s %16s %14s %9s %11s %9s %9s %12s %7s %6s %7s",fmt," ")         
            }
        ($title_column ~ title) && ($source_column ~ _sourcefile) && ($source_column ~ box) {
         if ( $xp[6] ~ / / ) {         # <-- Here's the problem
             printf "%-8s", $score
             exit 0
         }
         else {
        }
        ($title_column ~ title) && ($source_column ~ _sourcefile) && ($source_column ~ box) 
    {
            printf "%-8s =", $score
            for ( i=1; i<=nxp; i++ ) {
                printf ("%s" fmt[i]), OFS, $(xp[i])
            }
            print ""
        }
    ' "$file"

The problem is that the if statement should prevent empty or blank fields from being printed, and jump to the print $score step ignoring the else. To explain more easily i'll put down a test output:

    #Desired output 
         XP HBond + Electro + PhobEn + PhobEnHB + LowMW + RotPenal + LipophilicEvdW + PhobEnPairHB + Sitemap + Penalties + PiStack + HBPenal + ExposPenal + PiCat + ClBr + Zpotr

-9,473 =   -0,953    -1,133   -2,700      0,000   0,000      0,211           -3,298          0,000    -1,600       0,000     0,000     0,000        0,000   0,000  0,000   0,000
    #Desired output if one or more indexes are blanks
         XP HBond + Electro + PhobEn + PhobEnHB + LowMW + RotPenal + LipophilicEvdW + PhobEnPairHB + Sitemap + Penalties + PiStack + HBPenal + ExposPenal + PiCat + ClBr + Zpotr
-9,473

This is what the else statement should print IF all the elements in xp exist, unfortunately if the csv has the fields but the values in the records are empty (i.e. "") the script will print something like:

         XP HBond + Electro + PhobEn + PhobEnHB + LowMW + RotPenal + LipophilicEvdW + PhobEnPairHB + Sitemap + Penalties + PiStack + HBPenal + ExposPenal + PiCat + ClBr + Zpotr
-7,897 =    0,000

Instead i want it to print only the first number, stored in $score. How can i check whether ALL the arrays' indexes/values exist and are not blank/empty?

EDIT

To make it simpler to explain my problem, i've made a (fake) $file that can be used for testing out possible solutions:

Title,Score,Score-1,Score-2,Score-3,Score-4
foo,4.9,1.2,,,,

This kind of file give me trouble because although no value is specified in record 2 and field 4, 5 or 6, awk still prints out an (empty) line, and stores it inside the array xp. Trying it out by modifying the printf format i found out that the element in the array is empty, but still populated. Also checking it out with nxp it still finds 5 different elements although only 2 numbers are displayed in the example file.

EDIT-2

I wish my real file was this short and simple, i've made an entire other script with the sole purpose of filtering it out and only finding the fields in a specific order so no possible way to print from field a to field b without a problem.

Gioele
  • 43
  • 7
  • @Fravadona How do i say awk to check if xp[index_whatever] is empty/blank? – Gioele Mar 15 '23 at 16:53
  • Check this `awk 'BEGIN{a[1]=1; if(a[2] == ""){print "empty"}}'`. Also see here https://stackoverflow.com/questions/11952854/how-to-check-if-the-variable-value-in-awk-script-is-null-or-empty – Andre Wildberg Mar 15 '23 at 17:02
  • Please reduce this to a [mcve] just about the problem you're asking for help with. We don't need/want to see 30 lines (or whatever) of awk for this, just a tiny 1 or 2 line script about THIS problem. Also make it clear if you want to test for the array index being unpopulated or the index populated with a null string or some other value. – Ed Morton Mar 15 '23 at 17:05
  • 2
    @AndreWildberg [that](https://stackoverflow.com/questions/75747534/bash-awk-check-an-awk-array-with-an-if-else-statement-to-see-if-one-or-more-el#comment133624089_75747534) would **create** array elements when none were previously present so it's unlikely to be what the OP should use no matter what they're trying to do. The question you linked also doesn't have a good answer for this. – Ed Morton Mar 15 '23 at 17:09
  • `var ~ /^[ \t]*$/` would be true if the field is empty or filled with whitespace. – tripleee Mar 15 '23 at 17:31

2 Answers2

3

Assumptions:

  • if any of the $(xp[i]) values evaluate to blank ("") then the only ouput should be the $score

One idea where we first test all $(xp[i]) values before running the current for/printf code:

# current code:

BEGIN { nxp  = split(xp_terms_columns,xp," ")
        nfmt = split("%-8s %s %9s %9s %8s %10s %7s %10s %16s %14s %9s %11s %9s %9s %12s %7s %6s %7s",fmt," ")         
      }

($title_column ~ title) && ($source_column ~ _sourcefile) && ($source_column ~ box)
{
    # new code:

    missing_vals=0                         # default assumption that all fields are non-blank

    for ( i=1; i<=nxp; i++ ) {             # loop through array
        if ($(xp[i]) == "") {              # if we find a blank value then ...
           missing_vals=1                  # set our flag and ...
           break                           # break out of "for" loop
        }
    } 

    # current code:

    if ( missing_vals ) {                  # if missing_vals==1 ...
       printf "%-8s", $score
       exit 0
    }
}

($title_column ~ title) && ($source_column ~ _sourcefile) && ($source_column ~ box) 
{
    printf "%-8s =", $score
    for ( i=1; i<=nxp; i++ ) {
        printf ("%s" fmt[i]), OFS, $(xp[i])
    }
    print ""
}

Instead of looping through the array twice we should be able to obtain the same result with a single pass through the array while building the (potential) output line in a variable.

One idea using a single pass through the array while also reducing some of the redundant code:

BEGIN { nxp  = split(xp_terms_columns,xp," ")
        nfmt = split("%-8s %s %9s %9s %8s %10s %7s %10s %16s %14s %9s %11s %9s %9s %12s %7s %6s %7s",fmt," ")         
      }

# following should replace rest of the current code:

($title_column ~ title) && ($source_column ~ _sourcefile) && ($source_column ~ box)
{
    out=""                                                # clear/reset output variable

    for ( i=1; i<=nxp; i++ ) {
        if ($(xp[i]) == "") {                             # if blank then ...
           out=""                                         # clear/reset output variable and ...
           break                                          # break out of loop
        }
        out=out sprintf( ("%s" fmt[i]), OFS, $(xp[i]) )   # continue building output line
    }

    printf "%-8s", $score

    if (out)
        printf " = %s", out

    print ""
}

NOTE: the question does not include a complete set of sample input data so it won't be possible to test this solution for accuracy

markp-fuso
  • 28,790
  • 4
  • 16
  • 36
  • Your solution might be really close to the point, but the check shouldn't happen before reaching the step of `printf "%-8s ="? I put it before even giving it the choice to print this equal sign, since the array is defined in the BEGIN statement i put the if statement immediately after that (I added a comment to point out where i think the problem is). – Gioele Mar 16 '23 at 15:28
  • @Gioele if it's just a question of whether or not to print the `=` ... you could either break that `printf` into 2 parts (print the `=` after the test); alternatively, move the are-any-fields-blank test up a few lines in the code ... I've updated the answer with one idea for implementing this alternative – markp-fuso Mar 16 '23 at 15:51
  • After moving the { before "for ( i=1; i<=nxp; i++ )" it worked like a charm! For some reason that i can't explain it's considered an action and had to include it into the curly braces. It worked exactly like intended, i just included another conditional to check if nxp < 16 (which is my number of fields) and it did exactly what it was asked to. Thanks! – Gioele Mar 16 '23 at 16:59
  • ah, yeah, I overlooked the formatting in the original question; appears I got quite a bit out of whack, eh; I've taken another jab at the formatting; I've also added an additional answer based on a single pass through the array – markp-fuso Mar 16 '23 at 17:24
0

@Gioele : it's basically the same logic : just do a substr() for 1 character for each field you need and concat them together. If they're all existing, then the string length would match # of fields needed :

xp_terms_columns=( $( jot 127 | rev | shuf | rev | head -n 16 ) )

echo "\n\t ${#xp_terms_columns[@]} | ${xp_terms_columns[*]}\n"

date | gawk -p- -be '

BEGIN {
        format = "%-8s %s %9s %9s %8s %10s %7s %10s %16s %14s %9s %11s %9s %9s %12s %7s %6s %7s\n"
}
($title_column  ~ title)       &&
($source_column ~ _sourcefile) &&
($source_column ~ box) {

    print '"$(

    awk 'sub("^","$", $!(NF=NF))^_' OFS=', $' <<< "$xp_terms_columns[*]" | 

    awk '{ __=sprintf("%*s", NF, _=ORS="")
                 gsub(".","%.1s",__)

       print (NF)"==sprintf(\""(__)"\", "(_=$+_) \
               ") ? sprintf( format, $score, \"= \", "(_)") : $score \"= 0.00\" " }'    
  
     )"' }'
    16 | 75 59 44 87 95 107 8 110 64 29 81 2 16 115 6 97 
0.00
    # gawk profile, created Fri Mar 17 00:14:35 2023

    # BEGIN rule(s)

    BEGIN {
     1      format = "%-8s %s %9s %9s %8s %10s %7s %10s %16s %14s %9s %11s %9s %9s %12s %7s %6s %7s\n"
    }

    # Rule(s)

     1  ($title_column ~ title) && ($source_column ~ _sourcefile) && ($source_column ~ box) { # 1

     1    print 16 == sprintf("%.1s%.1s%.1s%.1s%.1s%.1s%.1s%.1s%.1s%.1s%.1s%.1s%.1s%.1s%.1s%.1s",

           $75, $59, $44, $87, $95, $107, $8, $110, $64, $29, $81, $2, $16, $115, $6, $97) ? sprintf(format, $score, "= ",

           $75, $59, $44, $87, $95, $107, $8, $110, $64, $29, $81, $2, $16, $115, $6, $97) : $score "= 0.00"
    }

Additionally, even that 16 == is being dynamically generated not hard-coded in

in general, you can get that chain of %.1%%.1s… dynamically like this :

echo '$13, $82, $126, $76, $52, $123, $88, $60, 
      $73, $22, $86, $99, $126, $76, $52, $123, 
      $88, $60, $73, $22, $86, $99, $61, $75, $68, $45' | 

awk '{ print $(_=(__=NF)-__) ORS $+(NF *= OFS = "%."(++_)"s")^(NF=_+__) }'
$13, $82, $126, $76, $52, $123, $88, $60, $73, $22, $86, $99, $126, $76, $52, $123, $88, $60, $73, $22, $86, $99, $61, $75, $68, $45
%.1s%.1s%.1s%.1s%.1s%.1s%.1s%.1s%.1s%.1s%.1s%.1s%.1s%.1s%.1s%.1s%.1s%.1s%.1s%.1s%.1s%.1s%.1s%.1s%.1s%.1s
    # gawk profile, created Fri Mar 17 00:42:00 2023

    # Rule(s)

     1  {
     1      print $(_=(__=NF)-__) ORS $+(NF*=OFS="%." (++_) "s")^(NF=_+__)
    }

NF *= OFS = … : since OFS begins with "%", it can only be converted to zero when treated numerically, which would set NF = 0 to clear out all existing data in the input row before the last part of NF = _+__, so whole row would have empty fields plus original-#-of-cols# copies of that %.1s

regarding $+(..)^(..) - first (..) always yields a zero, and second one is always greater than zero. zero-to-non-zero-th power is always a zero, so that whole thing becomes $+0, or simply, $0 - printing out the newly arranged row.

But that whole thing could be further streamlined to just

print $!(_=NF) ORS $+(NF *= OFS = "%."(!!_)"s")^(NF = ++_)

you can even replace all of them with emojis too if you like :

echo '$13, $82, $126, $76, $52, $123, $88, $60, $73, $22, 
      $86, $99, $126, $76, $52, $123, $88, $60, $73, $22, 
      $86, $99, $61, $75, $68, $45' |

mawk '{

   print $!(_=NF) ORS  \
                        \
         $+(  NF *= OFS = "\360\237\230\203")^(NF=++_)

     1  $13, $82, $126, $76, $52, $123, $88, $60, $73, $22, $86, $99, $126, $76, $52, $123, $88, $60, $73, $22, $86, $99, $61, $75, $68, $45

     2  
RARE Kpop Manifesto
  • 2,453
  • 3
  • 11
  • This solution is so advanced that i have no idea what is going on through most of it, I got the check to see if there's at least one character associated with each index of the awk array, but the way you're checking it confuses me. – Gioele Mar 17 '23 at 13:27
  • @Gioele : i'm simply using `sprintf"%.1s", str0)` in lieu of `substr(str0,1,1)` , but advantage of `sprintf()` is that I can do them all in one shot and not be calling `substr()` 16 times. `sprintf()` is basically same as `printf()`, but instead of directly printing out, it returns a string that is identical to what it would've been if you used `printf()`. Unlike `perl`, single underscore `_` is not a reserved variable name - anyone can use it. `awk` is very straightforward - everytime there's `+= -= *= /= %=`, that's ALWAYS performing straight up arithmetic…. – RARE Kpop Manifesto Mar 17 '23 at 14:46
  • @Gioele : and having weakly typed variables is a great asset for `awk` - you can have variables recycled for different roles within the same function. Data types are also straight forward one data type of numbers, one type of strings, one type of arrays, and you don't even have to "declare" the type at any point of time - it takes on whatever type its present data value represents. in many tasks, `perl` is slower than `awk` these days, and `python3` is a snail next to `awk` – RARE Kpop Manifesto Mar 17 '23 at 14:56
  • @Gioele : but if the `sprintf()` approach is a bit overwhelming, you can always do something like :::::::::::::::::::::::::: `( $23 != "" ) * ( $29 != "" ) * ( $31 != "" ) …..` - the `!=` returns a boolean result of `0` for false or `1` for true, meaning, to ensure every single field exists, simply multiply the booleans together (essentially `&&` nonstop but cleaner). The result will also be 0 or 1. Whenever there's at least 1 field you need that doesn't have any values in it, the whole multiplication chain becomes 0. – RARE Kpop Manifesto Mar 17 '23 at 15:00
  • Ok i think i can work out what is happening in the script, just a question: why call awk inside another awk (didn't even know that was possible), isn't it simpler to output it into a variable/array and declare it in a single awk? – Gioele Mar 17 '23 at 16:49
  • @Gioele : performance - that's why. The tiny upfront cost of the extra `awk` call by the shell is more than paid for since the new code has no loops at all, and all fields hard-coded in. `awk` has no built-in mechanism to do random code `eval()` and execute in the middle of a session, so a substitution approach is the cheapest approach to get some dynamic code in. Since it's a shell substitution, that part must be done upfront, and just once, instead of having to invoke external calls once inside `awk`. That said, it's nothing close to an actual `eval()`, if that's what you're looking for – RARE Kpop Manifesto Mar 17 '23 at 17:53
  • @Gioele : the way i wrote it wasn't `awk` calling another `awk`, but rather, using a shell-level substitution to use 1st `awk` to dynamically generate code, and 2nd `awk` to execute it. if `awk` had a true `eval()` one could combine those steps, but a las nothing is perfect in life. – RARE Kpop Manifesto Mar 18 '23 at 17:18