0

I have a tab-separated file I need to order by the length of the first field. I've found samples of a line that should do that for me, but it's giving very strange results:

awk -F\t '{print length($1) " " $0|"sort -rn"}' SpanishGlossary.utf8 | sed 's/^.[^>]*>/>/' > test.tmp

... gives this (several representative samples -- it's a very long file):

56 cafés especiales y orgánicos special and organic coffees
56 amplia experiencia gerencial broad managerial experience
55 una fundada confianza en que a well-founded confidence that
55 Servicios de Desarrollo Empresarial  Business Development Services
...
6 son estas are these
6 son entregadas a  are given to
6 son determinantes para    are crucial for
6 son autolimitativos   are self-limiting
...
0 tal grado de  such a degree of
0 tales such
0 tales propósitos  such purposes
0 tales principios  such principles
0 tales o cuales    this or that

That leading number should be the length of the first field, but it's obviously not. I don't know what that's counting.

What am I doing wrong? Thanks.

user1889034
  • 333
  • 3
  • 11
  • 1
    Your count problem is with your field separator. You need to quote the argument `-F'\t'`. As written awk is using an FS of `t`. – Etan Reisner Mar 26 '14 at 15:30
  • That explains a lot. Thanks! – user1889034 Mar 26 '14 at 17:26
  • @EtanReisner: excellent point - now the numbers make sense (length up to the first `t`); on a somewhat related note: beware of `awk`'s `length()` function on OSX (as of 10.9.2) with UTF-8 locales: it counts the bytes of multi-byte chars. individually; e.g., `awk '{ print length($0) }' <<<'é'` returns `2`(!). – mklement0 Mar 26 '14 at 17:33

1 Answers1

4

try this:

awk '$0=length($1) FS $0' file | sort -nr | sed -r 's/^\S*\s//'

test:

kent$  cat f
as foo
a foo
aaa foo
aaaaa foo
aaaa foo

kent$  awk '$0=length($1) FS $0' f|sort -nr|sed -r 's/^\S*\s//'
aaaaa foo
aaaa foo
aaa foo
as foo
a foo

here I used space(default) as awk's FS, if you need the tab, add -F'\t'

EDIT

add one awk (gnu awk) only one-liner for @Jaypal,

I mentioned gawk, because it has asort and asorti which we could use for sorting.

also I changed the input file to add some same length ($1) lines.

better "@val_num_asc" or desc in asorti(a,b,"...")

kent$  cat f
as foo
a foo
aaa foo
ccc foo
aaaaa foo
bbbbb foo
aaaa foo

kent$  awk '{a[length($1)"."NR]=$0}END{asorti(a,b);for(i=NR;i>0;i--)print a[b[i]]}' f
bbbbb foo
aaaaa foo
aaaa foo
ccc foo
aaa foo
as foo
a foo
Community
  • 1
  • 1
Kent
  • 189,393
  • 32
  • 233
  • 301
  • 2
    +1 In place of sed, I'd write `cut -d ' ' -f 2-` to remove the 1st field. – glenn jackman Mar 26 '14 at 14:55
  • @glennjackman Is there a way we can do numerical sort in `awk`. I wrote a solution to solve the question stated above entirely in `awk` and ran into the issue where length of string is 13 and it gets printed last since 13<2. I wasn't sure about posting the solution since it is broken (wiki maybe?) – jaypal singh Mar 26 '14 at 15:52
  • Sounds like a perfect StackOverflow question! – glenn jackman Mar 26 '14 at 16:00
  • @jaypal didn't notice your new comment, I just added an awk (gawk) only solution in answer. – Kent Mar 26 '14 at 16:13
  • On the phone but does it work if one string is over 10 characters? – jaypal singh Mar 26 '14 at 16:16
  • @jaypal good point, I added the `PROCINFO["sorted_in"]` to let `asorti` sort on numbers – Kent Mar 26 '14 at 16:18
  • +1 for elegance, but note that there's nothing wrong with the OP's approach, except for the field-separator issue pointed out in a comment by @EtanReisner. – mklement0 Mar 26 '14 at 17:19