5

I am trying to numerically sort a series of files output by the ls command which match the pattern either ABCDE1234A1789.RST.txt or ABCDE12345A1789.RST.txt by the '789' field.

In the example patterns above, ABCDE is the same for all files, 1234 or 12345 are digits that vary but are always either 4 or 5 digits in length. A1 is the same length for all files, but value can vary so unfortunately it can't be used as a delimiter. Everything after the first . is the same for all files. Something like:

ls -l *.RST.txt | sort -k +9.13 | awk '{print $9} ' > file-list.txt

will match the shorter filenames but not the longer ones because of the variable length of characters before the field I want to sort by.

Is there a way to accomplish sorting all files without first padding the shorter-length files to make them all the same length?

  • 1
    FYI -- the `sort` command is not part of bash, but a standard UNIX utility. As such, it's available to any program (and any shell). – Charles Duffy Sep 04 '13 at 21:32

3 Answers3

4

Perl to the rescue!

perl -e 'print "$_\n" for sort { substr($a, -11, 3) cmp substr($b, -11, 3) } glob "*.RST.txt"'

If your perl is more recent (5.10 or newer), you can shorten it to

perl -E 'say for sort { substr($a, -11, 3) cmp substr($b, -11, 3) } glob "*.RST.txt"'
choroba
  • 231,213
  • 25
  • 204
  • 289
3

Because of the parts of the filename which you've identified as unchanging, you can actually build a key which sort will use:

$ echo ABCDE{99999,8765,9876,345,654,23,21,2,3}A1789.RST.txt \
  | fmt -w1 \
  | sort -tE -k2,2n --debug
ABCDE2A1789.RST.txt
     _
___________________
ABCDE3A1789.RST.txt
     _
___________________
ABCDE21A1789.RST.txt
     __
etc.

What this does is tell sort to separate the fields on character E, then use the 2nd field numerically. --debug arrived in coreutils 8.6, and can be very helpful in seeing exactly what sort is doing.

PhilR
  • 5,375
  • 1
  • 21
  • 27
2

The conventional way to do this in bash is to extract your sort field. Except for the sort command, the following is implemented in pure bash alone:

sort_names_by_first_num() {
  shopt -s extglob
  for f; do
    first_num="${f##+([^0-9])}";
    first_num=${first_num%[^0-9]*};
    [[ $first_num ]] && printf '%s\t%s\n' "$first_num" "$f"
  done | sort -n | while IFS='' read -r name; do name=${name#*$'\t'}; printf '%s\n' "$name"; done
}

sort_names_by_first_num *.RST.txt

That said, newline-delimiting filenames (as this question seems to call for) is a bad practice: Filenames on UNIX filesystems are allowed to contain newlines within their names, so separating them by newlines within a list means your list is unable to contain a substantial subset of the range of valid names. It's much better practice to NUL-delimit your lists. Doing that would look like so:

sort_names_by_first_num() {
  shopt -s extglob
  for f; do
    first_num="${f##+([^0-9])}";
    first_num=${first_num%[^0-9]*};
    [[ $first_num ]] && printf '%s\t%s\0' "$first_num" "$f"
  done | sort -n -z | while IFS='' read -r -d '' name; do name=${name#*$'\t'}; printf '%s\0' "$name"; done
}

sort_names_by_first_num *.RST.txt
Charles Duffy
  • 280,126
  • 43
  • 390
  • 441
  • Thanks, Charles, for a very comprehensive alternative. In this use-case, perl is available to the users and it's perhaps marginally easier to implement than the bash function, but I very much appreciate the option! – Michael Meech Sep 04 '13 at 21:47