5

How I can sort array data alphanumerically in ruby?

Suppose my array is a = [test_0_1, test_0_2, test_0_3, test_0_4, test_0_5, test_0_6, test_0_7, test_0_8, test_0_9, test_1_0, test_1_1, test_1_2, test_1_3, test_1_4, test_1_5, test_1_6, test_1_7, test_1_8, test_1_9, test_1_10, test_1_11, test_1_12, test_1_13, test_1_14, ...........test_1_121...............]

I want my output to be:

.
.
.
test_1_121
.
.
.
test_1_14
test_1_13
test_1_12
test_1_11
test_1_10
test_1_9
test_1_8
test_1_7
test_1_6
test_1_5
test_1_4
test_1_3
test_1_2
test_1_1
test_0_10
test_0_9
test_0_8
test_0_7
test_0_6
test_0_5
test_0_4
test_0_3
test_0_2
test_0_1
mu is too short
  • 426,620
  • 70
  • 833
  • 800
user682932
  • 53
  • 1
  • 3
  • 1
    Because the sort required for this isn't a straightforward comparison of two values, you'll want to use `sort_by`. Overhead induced by manipulating strings, or diving into objects, can kill `sort`. – the Tin Man Mar 30 '11 at 05:03

9 Answers9

8

A generic algorithm for sorting strings that contain non-padded sequence numbers at arbitrary positions.

padding = 4
list.sort{|a,b|
  a,b = [a,b].map{|s| s.gsub(/\d+/){|m| "0"*(padding - m.size) + m } }
  a<=>b
}

where padding is the field length you want the numbers to have during comparison. Any number found in a string will be zero padded before comparison if it consists of less than "padding" number of digits, which yields the expected sorting order.

To yield the result asked for by the user682932, simply add .reverse after the sort block, which will flip the natural ordering (ascending) into a descending order.

With a pre-loop over the strings you can of course dynamically find the maximum number of digits in the list of strings, which you can use instead of hard-coding some arbitrary padding length, but that would require more processing (slower) and a bit more code. E.g.

padding = list.reduce(0){|max,s| 
  x = s.scan(/\d+/).map{|m|m.size}.max
  (x||0) > max ? x : max
}
Jojje
  • 81
  • 1
  • 4
  • I like the padding idea, but it will not work in two situations: 1. sequences of digits longer than the padding. 2. long strings like "1.1.1.1..." with a big padding will get their size multiplied so much that it will not fit in memory. – Cœur Apr 22 '19 at 14:20
5

If you simply sort as string, you will not get the correct ordering between 'test_2' and 'test_10', for example. So do:

sort_by{|s| s.scan(/\d+/).map{|s| s.to_i}}.reverse
sawa
  • 165,429
  • 45
  • 277
  • 381
2

You can pass a block to the sort function to custom sort it. In your case you will have a problem because your numbers aren't zero padded, so this method zero pads the numerical parts, then sorts them, resulting in your desired sort order.

a.sort { |a,b|
  ap = a.split('_')
  a = ap[0] + "%05d" % ap[1] + "%05d" % ap[2]
  bp = b.split('_')
  b = bp[0] + "%05d" % bp[1] + "%05d" % bp[2]
  b <=> a
}
Chris Cherry
  • 28,118
  • 6
  • 68
  • 71
2

Sort routines can have greatly varying processing times. Benchmarking variations of the sort can quickly home in on the fastest way to do things:

#!/usr/bin/env ruby

ary = %w[
    test_0_1  test_0_2   test_0_3 test_0_4 test_0_5  test_0_6  test_0_7
    test_0_8  test_0_9   test_1_0 test_1_1 test_1_2  test_1_3  test_1_4  test_1_5
    test_1_6  test_1_7   test_1_8 test_1_9 test_1_10 test_1_11 test_1_12 test_1_13
    test_1_14 test_1_121
]

require 'ap'
ap ary.sort_by { |v| a,b,c = v.split(/_+/); [a, b.to_i, c.to_i] }.reverse

And its output:

>> [
>>     [ 0] "test_1_121",
>>     [ 1] "test_1_14",
>>     [ 2] "test_1_13",
>>     [ 3] "test_1_12",
>>     [ 4] "test_1_11",
>>     [ 5] "test_1_10",
>>     [ 6] "test_1_9",
>>     [ 7] "test_1_8",
>>     [ 8] "test_1_7",
>>     [ 9] "test_1_6",
>>     [10] "test_1_5",
>>     [11] "test_1_4",
>>     [12] "test_1_3",
>>     [13] "test_1_2",
>>     [14] "test_1_1",
>>     [15] "test_1_0",
>>     [16] "test_0_9",
>>     [17] "test_0_8",
>>     [18] "test_0_7",
>>     [19] "test_0_6",
>>     [20] "test_0_5",
>>     [21] "test_0_4",
>>     [22] "test_0_3",
>>     [23] "test_0_2",
>>     [24] "test_0_1"
>> ]

Testing the algorithms for speed shows:

require 'benchmark'

n = 50_000
Benchmark.bm(8) do |x|
  x.report('sort1') { n.times { ary.sort { |a,b| b <=> a }         } }
  x.report('sort2') { n.times { ary.sort { |a,b| a <=> b }.reverse } }
  x.report('sort3') { n.times { ary.sort { |a,b|
                                  ap = a.split('_')
                                  a = ap[0] + "%05d" % ap[1] + "%05d" % ap[2]
                                  bp = b.split('_')
                                  b = bp[0] + "%05d" % bp[1] + "%05d" % bp[2]
                                  b <=> a
                                } } }

  x.report('sort_by1') { n.times { ary.sort_by { |s| s                                               }         } }
  x.report('sort_by2') { n.times { ary.sort_by { |s| s                                               }.reverse } }
  x.report('sort_by3') { n.times { ary.sort_by { |s| s.scan(/\d+/).map{ |s| s.to_i }                 }.reverse } }
  x.report('sort_by4') { n.times { ary.sort_by { |v| a = v.split(/_+/); [a[0], a[1].to_i, a[2].to_i] }.reverse } }
  x.report('sort_by5') { n.times { ary.sort_by { |v| a,b,c = v.split(/_+/); [a, b.to_i, c.to_i]      }.reverse } }
end


>>               user     system      total        real
>> sort1     0.900000   0.010000   0.910000 (  0.919115)
>> sort2     0.880000   0.000000   0.880000 (  0.893920)
>> sort3    43.840000   0.070000  43.910000 ( 45.970928)
>> sort_by1  0.870000   0.010000   0.880000 (  1.077598)
>> sort_by2  0.820000   0.000000   0.820000 (  0.858309)
>> sort_by3  7.060000   0.020000   7.080000 (  7.623183)
>> sort_by4  6.800000   0.000000   6.800000 (  6.827472)
>> sort_by5  6.730000   0.000000   6.730000 (  6.762403)
>> 

Sort1 and sort2 and sort_by1 and sort_by2 help establish baselines for sort, sort_by and both of those with reverse.

Sorts sort3 and sort_by3 are two other answers on this page. Sort_by4 and sort_by5 are two spins on how I'd do it, with sort_by5 being the fastest I came up with after a few minutes of tinkering.

This shows how minor differences in the algorithm can make a difference in the final output. If there were more iterations, or larger arrays being sorted the differences would be more extreme.

the Tin Man
  • 158,662
  • 42
  • 215
  • 303
1

Similar to @ctcherry answer, but faster:

a.sort_by {|s| "%s%05i%05i" % s.split('_') }.reverse

EDIT: My tests:

require 'benchmark'
ary = []
100_000.times { ary << "test_#{rand(1000)}_#{rand(1000)}" }
ary.uniq!; puts "Size: #{ary.size}"

Benchmark.bm(5) do |x|
  x.report("sort1") do
    ary.sort_by {|e| "%s%05i%05i" % e.split('_') }.reverse
  end
  x.report("sort2") do
    ary.sort { |a,b|
      ap = a.split('_')
      a = ap[0] + "%05d" % ap[1] + "%05d" % ap[2]
      bp = b.split('_')
      b = bp[0] + "%05d" % bp[1] + "%05d" % bp[2]
      b <=> a
    } 
  end
  x.report("sort3") do
    ary.sort_by { |v| a, b, c = v.split(/_+/); [a, b.to_i, c.to_i] }.reverse
  end
end

Output:

Size: 95166

           user     system      total        real
sort1  3.401000   0.000000   3.401000 (  3.394194)
sort2 94.880000   0.624000  95.504000 ( 95.722475)
sort3  3.494000   0.000000   3.494000 (  3.501201)
Guilherme Bernal
  • 8,183
  • 25
  • 43
1

Posting here a more general way to perform a natural decimal sort in Ruby. The following is inspired by my code for sorting "like Xcode" from https://github.com/CocoaPods/Xcodeproj/blob/ca7b41deb38f43c14d066f62a55edcd53876cd07/lib/xcodeproj/project/object/helpers/sort_helper.rb, itself loosely inspired by https://rosettacode.org/wiki/Natural_sorting#Ruby.

Even if it's clear that we want "10" to be after "2" for a natural decimal sort, there are other aspects to consider with multiple possible alternative behaviors wanted:

  • How do we treat equality like "001"/"01": do we keep the original array order or do we have a fallback logic? (Below, choice is made to have a second pass with a strict ordering logic in case of equality during first pass)
  • Do we ignore consecutive spaces for sorting, or does each space character count? (Below, choice is made to ignore consecutive spaces on first pass, and have a strict comparison on the equality pass)
  • Same question for other special characters. (Below, choice is made to make any non-space and non-digit character count individually)
  • Do we ignore case or not; is "a" before or after "A"? (Below, choice is made to ignore case on first pass, and we have "a" before "A" on the equality pass)

With those considerations:

  • It means that we should almost certainly use scan instead of split, because we're going to have potentially three kinds of substrings to compare (digits, spaces, all-the-rest).
  • It means that we should almost certainly work with a Comparable class and with def <=>(other) because it's not possible to simply map each substring to something else that would have two distinct behaviors depending on context (the first pass and the equality pass).

This results in a bit lengthy implementation, but it works nicely for edge situations:

  # Wrapper for a string that performs a natural decimal sort (alphanumeric).
  # @example
  #   arrayOfFilenames.sort_by { |s| NaturalSortString.new(s) }
  class NaturalSortString
    include Comparable
    attr_reader :str_fallback, :ints_and_strings, :ints_and_strings_fallback, :str_pattern

    def initialize(str)
      # fallback pass: case is inverted
      @str_fallback = str.swapcase
      # first pass: digits are used as integers, spaces are compacted, case is ignored
      @ints_and_strings = str.scan(/\d+|\s+|[^\d\s]+/).map do |s|
        case s
        when /\d/ then Integer(s, 10)
        when /\s/ then ' '
        else s.downcase
        end
      end
      # second pass: digits are inverted, case is inverted
      @ints_and_strings_fallback = @str_fallback.scan(/\d+|\D+/).map do |s|
        case s
        when /\d/ then Integer(s.reverse, 10)
        else s
        end
      end
      # comparing patterns
      @str_pattern = @ints_and_strings.map { |el| el.is_a?(Integer) ? :i : :s }.join
    end

    def <=>(other)
      if str_pattern.start_with?(other.str_pattern) || other.str_pattern.start_with?(str_pattern)
        compare = ints_and_strings <=> other.ints_and_strings
        if compare != 0
          # we sort naturally (literal ints, spaces simplified, case ignored)
          compare
        else
          # natural equality, we use the fallback sort (int reversed, case swapped)
          ints_and_strings_fallback <=> other.ints_and_strings_fallback
        end
      else
        # type mismatch, we sort alphabetically (case swapped)
        str_fallback <=> other.str_fallback
      end
    end
  end

Usage

Example 1:

arrayOfFilenames.sort_by { |s| NaturalSortString.new(s) }

Example 2:

arrayOfFilenames.sort! do |x, y|
  NaturalSortString.new(x) <=> NaturalSortString.new(y)
end

You may find my test case at https://github.com/CocoaPods/Xcodeproj/blob/ca7b41deb38f43c14d066f62a55edcd53876cd07/spec/project/object/helpers/sort_helper_spec.rb, where I used this reference for ordering: [ ' a', ' a', '0.1.1', '0.1.01', '0.1.2', '0.1.10', '1', '01', '1a', '2', '2 a', '10', 'a', 'A', 'a ', 'a 2', 'a1', 'A1B001', 'A01B1', ]

Of course, feel free to customize your own sorting logic now.

Community
  • 1
  • 1
Cœur
  • 37,241
  • 25
  • 195
  • 267
  • Can this somehow be applied to`Dir.entries(src).sort.each do |item|` where filenames look like `1886–7 Los Angeles City Directory. Bynon. p221. Suzzalo.jpg` and I want to naturally sort on the digits after the `p` where they would be up to 9999 and may include other characters such as `pi` but anything but digits could be ignored for my purpose, but still need to include (ending up at end or beginning)? I can imagine grabbing all the file names in an array and then proceeding from there, but I'm poor at Ruby so a more direct approach would be preferred. And with all of the Ruby magic, maybe. – Greg Jul 11 '21 at 22:48
1

I checked the Wikipedia page for the Unix sort function, the GNU version of which has a -V flag which sorts "version strings" generically. (I take this to mean mixtures of digits and non-digits, where you want the numeric parts to be sorted numerically, and the non-numeric parts lexically).

The article states that:

The GNU implementation has a -V --version-sort option which is a natural sort of (version) numbers within text. Two text strings that are to be compared are split into blocks of letters and blocks of digits. Blocks of letters are compared alpha-numerically, and blocks of digits are compared numerically (i.e., skipping leading zeros, more digits means larger, otherwise the leftmost digits that differ determine the result). Blocks are compared left-to-right and the first non-equal block in that loop decides which text is larger. This happens to work for IP addresses, Debian package version strings and similar tasks where numbers of variable length are embedded in strings.

sawa's solution works somewhat like this but doesn't sort by non-numeric parts.

So it seems useful to post a solution somewhere between Coeur and sawa's, which works like GNU sort -V

a.sort_by do |r|
  # Split the field into array of [<string>, nil] or [nil, <number>] pairs
  r.to_s.scan(/(\D+)|(\d+)/).map do |m|
    s,n = m
    n ? n.to_i : s.to_s # Convert number strings to integers
  end.to_a
end

In my case, I wanted to sort a TSV file by its fields like this, so as a bonus, here's the script for this case too:

require 'csv'

# Sorts a tab-delimited file input on STDIN, sortin

opts = {
  headers:true,
  col_sep: "\t",
  liberal_parsing: true,
}

table = CSV.new($stdin, **opts)


# Emulate unix's sort -V: split each field into an array of string or
# numeric values, and sort by those in turn. So for example, A10
# sorts above A100.
sorted_ary = table.sort_by do |r|
  r.fields.map do |f|
    # Split the field into array of [<string>, nil] or [nil, <number>] values
    f.to_s.scan(/(\D+)|(\d+)/).map do |m|
      s,n = m
      n ? n.to_i : s.to_s # Convert number strings to integers
    end.to_a
  end
end

puts CSV::Table.new(sorted_ary).to_csv(**opts)

(Aside: another solution here sorts using Gem::Version, but that only seems to work with well formed Gem version strings.)

wu-lee
  • 749
  • 4
  • 17
  • Although beware, TSVs are sometimes subtly different to a CSV with `\t` delimiters: https://stackoverflow.com/questions/4404787/whats-the-best-way-to-parse-a-tab-delimited-file-in-ruby – wu-lee Jan 05 '22 at 20:06
0

From the looks of it, you want to use the sort function and/or the reverse function.

ruby-1.9.2-p136 :009 > a = ["abc_1", "abc_11", "abc_2", "abc_3", "abc_22"]
 => ["abc_1", "abc_11", "abc_2", "abc_3", "abc_22"] 

ruby-1.9.2-p136 :010 > a.sort
 => ["abc_1", "abc_11", "abc_2", "abc_22", "abc_3"] 
ruby-1.9.2-p136 :011 > a.sort.reverse
 => ["abc_3", "abc_22", "abc_2", "abc_11", "abc_1"] 
Mike Lewis
  • 63,433
  • 20
  • 141
  • 111
  • Try this with a = [test_0_1, test_0_2, test_0_3, test_0_4, test_0_5, test_0_6, test_0_7, test_0_8, test_0_9, test_1_0, test_1_1, test_1_2, test_1_3, test_1_4, test_1_5, test_1_6, test_1_7, test_1_8, test_1_9, test_1_10, test_1_11, test_1_12, test_1_13, test_1_14, ...........test_1_121...............] having two underscores. It doesn't work. – user682932 Mar 30 '11 at 01:18
0

Ok, from your output , it seems like you just want it to reverse, so use reverse()

a.reverse
kurumi
  • 25,121
  • 5
  • 44
  • 52