1

I'm having issues with creating and sorting an array in Bash which takes its contents as lines from a command, takes certain parts of each line and operates on them before appending them to each line in the array.

To clarify, the command "bogoutil -d wordlist.db" gives output in this form:

hello 428 3654 20151116

Except that there's a few million of these lines.

I want to load each line of output the command into an array, take the absolute value of the first number minus the second, append that value onto the line in a new array, and then sort the new array by that new value.

The issue that I'm having is that I suspect that the IFS needs to change to "\n" to put each line of bogoutil output into an array, but then it needs to change again to tokenise the second and third integers in each line. Its hard to work out what my error is thus far, because there's well over 10 million lines in the array, but I can tell from the output I get that it is not what I should be getting - I think it is merely listing each line and not tokenising properly. Generally it runs for a while, prints a ton of output into the shell that is definitely not what I am expecting (I think its just a few of the tokens but definitely not all of them) and then prints

sort: cannot read: resultsarray: No such file or directory

Here is what I've written thus far

#!/bin/bash

IFS=$"\n" #set the IFS so it tokenises each line in the command
for i in $( bogoutil -d wordlist.db )
    do 
            echo $i
            OUTPUT=( ${i// \n} ) #swap out space for a newline so i can
                                 #tokenise by spaces
            BAD=${OUTPUT[1]}
            echo $BAD
            GOOD=${OUTPUT[2]}
            echo $GOOD
            DIFF=$GOOD-$BAD
            echo $DIFF
            if [ "$DIFF" -lt "0" ]
            then
                    DIFF=$DIFF \* -1
            fi
            NEWOUT="$OUTPUT $DIFF" #append the abs of the difference to
                                   #the line so i can sort by it
            resultsarray[i]=$NEWOUT
    done

sort -t " " -k 5 -g resultsarray

echo "${resultsarray[@]:0:10}"

Any assistance would be greatly appreciated. I'm really stumped here and not sure why its not working. I suspect its something to do with the way I'm trying to tokenise each line of output but I'm not sure. The other possibility (given that it lists tokens for a while and then just stops) is that there's just too many elements in the array and it runs out of allocated space. Is that a possibility?

Thanks in advance, any help you can provide is much appreciated.

EDIT: To clarify expected input and output.

A sample input would be

hello 4 1 20151116
goodbye 0 256 20151116
grant 428 3654 20151116

A expected output for that would be

grant 428 3654 20151116 3226
goodbye 0 256 20151116 256
hello 4 1 20151116 3

As you can see, its sorted by the absolute value of the difference between the first and second number. There's no negatives in the dataset, the lowest is 0.

EDIT: the awk solution below works great! I'm not sure how one would do with with Bash, but I suspect bash isn't the right way to go about it and its probably better to use awk anyway. Thanks for all the help, it was very much appreciated!

g.grinovski
  • 23
  • 1
  • 7
  • are you open to an alternate solution to "take the absolute value of the first number minus the second, append that value onto the line in a new array, and then sort" ? Based on that description, you have described a very normal unix pipleline like `yourcmd | awk '{do calcs; print output}' | sort` Good luck. – shellter Nov 17 '15 at 22:51
  • See http://mywiki.wooledge.org/BashFAQ/001 and http://mywiki.wooledge.org/DontReadLinesWithFor – glenn jackman Nov 17 '15 at 22:52
  • @shellter I am definitely open to an alternative solution. My question is how would i do the calculations, given I am sorting by a derived value and not any contents of the array itself? thanks I'll try it with while instead of for – g.grinovski Nov 17 '15 at 23:01
  • I haven't read your entire script so I won't talk about possible alternatives like @shellter suggested. Only regarding your last two commands `sort` and `echo`: `sort` is an external command, and doesn't sort a bash array. You need to pipe stuff into stdin and get output on stdout. Although I don't think it's a good idea, what you want to achieve there (if I understood you correctly) can be done by `IFS=$'\n'; sort -t " " -k 5 -g <<< "${resultsarray[*]}" | head -n 10`. – 4ae1e1 Nov 17 '15 at 23:06
  • Oh, backtracing a little bit, `resultsarray[i]=$NEWOUT` is a pretty ridiculous assignment. Your `i` is not an integer, you assigned it to a line each time. – 4ae1e1 Nov 17 '15 at 23:09
  • 4ae1e1 thanks i replaced sort with your line and it removed the error but it merely returned an empty line so I suspect the error is with tokenisation. – g.grinovski Nov 17 '15 at 23:18

1 Answers1

0

If I understand your question correctly (here is why it is so important to include sample output from you sample input),

 cat tst.file
 hello 428 3654 20151116
 goodby -428 3655 20151116

This is assuming that the input is NOT tab-separated data. Also, if you care to update your question with a slightly larger data set I'll be happy to try confirm this is a good solution. You might also want to include the required output from your input ;-) (hint, hint).

 awk '
    function abs( num) {return (num >0) ? -num : num;} 
    {res=abs($2)+$3 ; print $0 "\t" res}' tst.file \
 | sort -t"${tabChar}" -k2n

produces output like

hello 428 3654 20151116    3226
goodby -428 3655 20151116  3227

Some sort programs support -t"\t" to define a tabChar for the sort delimiter. Mine doesn't so, I define it separately like tabChar=" " where that is a real tab Char inside the dbl-quotes.


As I mentioned in the comments, you can simplify above (assuming std line endings from your program) like :

bogoutil -d wordlist.db \
| awk '....' \
| sort -k2n

IHTH

shellter
  • 36,525
  • 7
  • 83
  • 90
  • This seems like its almost working, but i think its sorting alphabetically rather than by the difference (near as I can tell from whats scrolling past). I'm unfamiliar with awk so i replaced "${tabChar}" with " " (as it was throwing an error with regards to multiple characters and -k2n). Would that possibly cause it and should I predefine tabchar beforehand? EDIT: it listed 1 as the last char for a while and now its listing -1 regularly so i think it is sorting alphabetically. – g.grinovski Nov 18 '15 at 00:19
  • EDIT 2: its terminated with the last few lines being out of order so I think thats the issue. – g.grinovski Nov 18 '15 at 00:23
  • Actually i think its sorting by the first number so yeah thats definitely the problem. I've replaced that part with $'\t' and that seems like its working. Thankyou so much for your help, its very much appreciated. – g.grinovski Nov 18 '15 at 00:38
  • Yep this worked out with the replacement. Thanks, it was very much appreciated! – g.grinovski Nov 18 '15 at 01:01