Sorting alphabetically using last column, using awk

Question

I am trying to sort a variable number of columns of text, sometimes there are 3 fields sometimes there are 2.

Example input:

        George W. Bush
        Brack Obama
        Micky Mouse
        John F. Kennedy

Desired result:

         George W. Bush
         John F. Kennedy
         Micky Mouse
         Brack Obama

I want to get them in alphabetical order by last name, so using the $3 or $2 field.

So far, I've flipped each line to have the last name in front. However, to sort them I can't seem to flip them back. Ive tried arrays and I get loads more output then expected(repeated).

I'd like to keep this only as a awk file.

I've thought about using another awk file to flip them back in (let's say) a script of awk files, but I am not able to create a file while in awk (using bash scripts). I've been reading A Practical Guide to Linux but the examples I've seen seem all the same. Thanks for reviewing my question.

Currently this is how I am getting it done

    {
         #print  $3 " " $1 " " $2;
         if($3 == ""){
            #print "me";
            print  $2 " " $1;
            #list[$3]= $2"  "$1
        }else{ 
            print $3" "$1" "$2 ;
            #list[$3]= $3" " $2" "$1;}
            #for(result in list){    print list[result];   }
        }
    }


    gawk -f fileUsed alphRecoredToBeUsed | sort

Leaves me with ranged values that get sorted the way I want them. However presenting them with the 1st original value while keeping the alpha ordering.

Could you please post expected output in code tags too? – RavinderSingh13 Sep 26 '17 at 00:33 — RavinderSingh13, Sep 26 '17 at 00:33

Ed Morton · Accepted Answer · 2017-09-26T05:24:12.643

4

With GNU awk for sorted_in:

$ awk '
    { a[$NF]=($NF in a ? a[$NF] ORS : "") $0 }
    END { PROCINFO["sorted_in"]="@ind_str_asc"; for (i in a) print a[i] }
' file
George W. Bush
John F. Kennedy
Micky Mouse
Brack Obama

or with any awk + sort + cut:

$ awk '{print $NF "\t" $0}' file | sort | cut -f2-
George W. Bush
John F. Kennedy
Micky Mouse
Brack Obama

edited Sep 26 '17 at 05:24

answered Sep 26 '17 at 05:18

Ed Morton

188,023
17
78
185

Depending on the order of input, this could present "Robert F. Kennedy" and "John F. Kennedy" in the wrong order. – Marc Lambrichs Sep 26 '17 at 07:14
The OP wants the output sorted by last name which presumably means retaining input order when multiple people have the same last name and that's what the awk script does. There's no reason to think any other order is better or more correct given duplicate last names Having said that the `awk | sort` will sort by first names given duplicate last names. That also might be correct but if not the OP can add `-k1,1`. My main point is that there is no naturally "right" order and so no "wrong" order. The OP simply hasn't told us anything about what to do with duplicates so any order is correct. – Ed Morton Sep 26 '17 at 13:36

score 2 · Answer 2 · answered Sep 26 '17 at 02:48

Here is script that uses gawk to sort based on the last word on each line:

#!/bin/sh
gawk '
function compare(i1, v1, i2, v2) {
    ct1 = split(v1, pcs1)
    ct2 = split(v2, pcs2)
    f1 = ct1 < 1 ? "" : pcs1[ct1]
    f2 = ct2 < 1 ? "" : pcs2[ct2]
    if (f1 < f2) return -1;
    if (f1 > f2) return 1;
    return 0
}
{ lines[++ct] = $0 }
END {
    asort(lines, sorted_lines, "compare");
    for (i = 1; i <= length(sorted_lines); i++)
        print sorted_lines[i]
}
' "$@"

It works for your example:

$ cat input
George W. Bush
Brack Obama
Micky Mouse
John F. Kennedy
$ ./s input
George W. Bush
John F. Kennedy
Micky Mouse
Brack Obama

(I'm using gawk 4.0.1, which supports a user-supplied comparison function.)

abhishek phukan · Answer 3 · 2017-09-26T08:02:54.487

2

This might be easier:

sh-4.4$ awk '{print $NF,$0}' file |sort -k1|awk '{$1="";print $0}'                                                                                                                   
 George W. Bush                                                                                                                                                                      
 John F. Kennedy                                                                                                                                                                     
 Micky Mouse                                                                                                                                                                         
 Barack Obama

what is being done: bring the last name to the front, Sort and then remove it from the output.

hope this helps

edited Sep 26 '17 at 08:02

answered Sep 26 '17 at 07:30

abhishek phukan

751
1
5
16

Wow, that kind of heavy on the number of subshells spawned (one per utility and pipe). Also, if you ever find yourself doing `cat file ...` and you are not actually concatenating files, that is probably an *Unnecessary Use Of `cat`* (referred to as an UUOc). At minimum you can shorten your answer to `awk '{print $NF,$0}' filename | ...` and eliminate the UUOc `:)` – David C. Rankin Sep 26 '17 at 07:59
hey @DavidC.Rankin Thanks, i edited the answer. :) Hope this is what the person requesting wants – abhishek phukan Sep 26 '17 at 08:03
Ok, that's an A for effort. Good job on the UUOc removal, and the sort order is fine. – David C. Rankin Sep 26 '17 at 08:05
also a sed can be used incase the leading whitespace is an issue: sed 's/^ *//g' – abhishek phukan Sep 26 '17 at 08:07

score 0 · Answer 4 · answered Sep 26 '17 at 00:35

One of my favorite awk variables is NF which is the Number of Fields in a record; meaning, the number of $1 $2... $NF where $NF is your last element. You can even do print $(NF-1) to make awk print your second to last element, or do any other math with that $(integer-after-math) notation if you ever find that need.

Instead of trying to swap everything around, just organize them based on $NF, which is the last name of each line in your data example.

score 0 · Answer 5 · answered Sep 26 '17 at 02:49

0

Here's one-line awk command to get the desired output,

$ awk '{a[$NF]=$0} END{PROCINFO["sorted_in"]="@ind_str_asc"; for(i in a)print a[i]}' file
        George W. Bush
        John F. Kennedy
        Micky Mouse
        Brack Obama

Brief explanation,

Use array a[$NF]=$0 to create the $NF and $0 map.
PROCINFO["sorted_in"]="@ind_str_asc": Order by indices in ascending order compared as strings. Referred awk manual for more details. Mind that it is specific to gawk.
for(i in a)print a[i]: because of the previous predefined array scanning orders, the array would be scanned in ascendin order.

answered Sep 26 '17 at 02:49

CWLiu

3,913
1
10
14

If Bush sr. turns up with his son, George H.W., in the input file, just one of them gets elected, correction, printed. – Marc Lambrichs Sep 26 '17 at 04:45
Yes, you're right. In this method, it can only scanned the order for the last field. – CWLiu Sep 26 '17 at 05:31

Marc Lambrichs · Answer 6 · 2017-09-26T09:55:28.373

You need to order all fields to make this worthwhile.

one-liner:

$ awk '{s="";for (i=1;i<NF;i++)s=s $i;a[s]=$0}END{n=asorti(a,b);for(j=1;j<=n;j++)print a[b[j]]}' input.txt

explanation:

{
  s=""                                 # initialize s
  for (i=1;i<NF;i++) s=s $i            # concatenate first and middle names
  a[$NF s]=$0                          # use last name followed by other names 
                                       # as index
}
END{
  n=asorti(a,b);                       # sort index of a
  for(j=1;j<=n;j++) print a[b[j]]      # print results
}

using this input:

$ cat input.txt
George W. Bush
George H.W. Bush
Michelle Obama
Barack Obama
Micky Mouse
John F. Kennedy

gives:

$ awk '{s="";for (i=1;i<NF;i++)s=s $i;a[$NF s]=$0}END{n=asorti(a,b);for(j=1;j<=n;j++)print a[b[j]]}' input.txt
George H.W. Bush
George W. Bush
John F. Kennedy
Micky Mouse
Barack Obama
Michelle Obama

And from gnu awk 4.1 you can use the join function:

@include "join"
{
  n=split($0, a, " ")
  s=join(a, 1, n-1)
  b[$NF s]=$0
}
END{
  n=asorti(b,c);
  for(j=1;j<=n;j++) print b[c[j]]
}

If `H.W.` is separated by a space like `H. W.` (I assume it would be) this solution gives me first `George W. Bush` and after that `George H. W. Bush`. That was one of the issues I tried and solved in my solution. — James Brown, Sep 26 '17 at 06:08
Then again: https://en.wikipedia.org/wiki/George_R._R._Martin . Also, the English wikipedia: https://en.wikipedia.org/wiki/George_H._W._Bush — James Brown, Sep 26 '17 at 06:22
`Rule 2.1.5 Add little or no space within strings of initials` from The elements of typographic style - R. Bringhurst. Nuff said. — Marc Lambrichs, Sep 26 '17 at 06:47

James Brown · Answer 7 · 2017-09-26T06:01:52.820

-1

In GNU awk:

$ awk '
{
    b=$NF                 # initialize the key buffer
    if(NF>1)              # if there are more than one word in the name
        for(i=1;i<NF;i++) # add them to the buffer
            b=b OFS $i
    a[b]=$0               # hash
}
END{
    PROCINFO["sorted_in"]="@ind_str_asc"  # order on the index using for
    for(i in a)
        print a[i]
}' file

outputs (added some usual suspects to the list for testing):

George H. W. Bush
George W. Bush
John F. Kennedy
John G. Kennedy
Madonna
Micky Mouse
Barack Obama
Brack Obama

As the key to hash the script uses lastname firstname_if_exists 1st_middle_if_exists etc., ie. a["Bush George H. W."]="George H. W. Bush".

edited Sep 26 '17 at 06:01

answered Sep 26 '17 at 05:16

James Brown

36,089
7
43
59

Nothing new compared to earlier given solutions in here. Pls. explain yourself. – Marc Lambrichs Sep 26 '17 at 05:25
Not `asort`ing, just using `for`. Using all the words in the name (in reverse order) for ordering, not just the last name (solving the problem mentioned in @CWLiu's answer... and apparently introducing a new problem). – James Brown Sep 26 '17 at 05:28
..which is copied from @CWLiu's answer. – Marc Lambrichs Sep 26 '17 at 05:29
If you are going to adapt code from other people's answers, then you *must* give them explicit credit for their efforts. We call this [attribution](https://stackoverflow.blog/2009/06/25/attribution-required/), and it's always required here. Under the CC BY-SA license we use for all content, you are allowed---even encouraged---to adapt and remix other people's solutions, but you must provide attribution to them. Please [edit] your answer to comply with these requirements, or it risks being deleted. – Cody Gray - on strike Sep 27 '17 at 09:26
@CodyGray And I would also. My process here is not to read other peoples' solutions but to build my own from the scratch. There really isn't but 2 ways of sorting provided by GNU awk (`asort` vs `for`) and I just happened to use the other one. Had I copied parts of my solution from others I would've referred to their work. What I brought new to solving this problem was ordering the parts of the name to the hash key, in my opinion, the correct way. How ironic, that got copied to another solution... My mistake was to release work-in-progress here. – James Brown Sep 27 '17 at 09:55

Sorting alphabetically using last column, using awk

7 Answers7