16

Update

On the same computer, using the Rakudo compiler "rakudo-moar-2021.06-01-macos-x86_64-clang.tar" I get 3-10 times speed-ups compared to the timings of my calculations in the original post.

.elems: 100000
.head(3): (id.20444 => 81.95246687507492 id.81745 => 34.859323339828464 id.79973 => 97.33816856420829)
time of .sort({-$_.value}) : 0.764283216
time of .sort(-*.value)    : 0.618963783
time of .sort.reverse      : 0.584477656
time of .values.sort       : 1.68912663

Note, that those timings are close to the R timings. So, on those kinds of computations Raku has the performance I expect.


Original post

I recently watched a FOSDEM presentation by Elizabeth Mattijsen titled "Raku - Sets without Borders" and decided to adopt Raku Mix objects in some of my computational workflows.

I noticed that sorting (the pairs of) a Mix object is very slow -- I would say 100 to 1000 times slower than what I expect. See the Raku code and output below. (I also provided related R code and output on the same computer.)

Is that slowness expected? Is there a work around for faster computations?

(To be more specific, I am interested in fast reverse-sort and fast retrieval of the top-K largest elements in a Mix.)

(The timings are on a few years old Mac Book Pro, Mac OS 10.15.7, using latest Rakudo Compiler "rakudo-moar-2021.02.1-01-macos-x86_64-clang.tar.gz" .)

Raku

#!/usr/bin/env perl6

my @words = Array(1 .. 100_000).map({ 'id.' ~ $_.Str });
my $m0 = Mix(@words.map({ $_ => 100.rand() }));
say '.elems: ', $m0.elems;
say '.head(3): ', $m0.head(3);

my $start = now;
my $m1 = $m0.sort({-$_.value});
say 'time of .sort({-$_.value}): ', now - $start;

$start = now;
my $m2 = $m0.sort(-*.value);
say 'time of .sort(-*.value)   : ', now - $start;

$start = now;
my $m3 = $m0.sort.reverse;
say 'time of .sort.reverse     : ', now - $start;

$start = now;
my $m4 = $m0.values.sort;
say 'time of .values.sort      : ', now - $start;

# .elems: 100000
# .head(3): (id.96239 => 87.89629474533156 id.89110 => 11.661698290245525 id.24795 => # 64.80528155838671)
# time of .sort({-$_.value}): 3.64936396
# time of .sort(-*.value)   : 4.0388654
# time of .sort.reverse     : 4.5783556
# time of .values.sort      : 4.3461059

R

Here is a similar data and sorting code in R:

words <- paste0( 'id.', 1:100000)
m <- setNames( runif(n = length(words), min = 0, max = 100), words)
cat( "length(m)                          : ", length(m), "\n")
cat( "m[1:3]:\n"); print(m[1:3]); cat("\n")
cat( "system.time( sort(m) )             : ", system.time( sort(m) ), "\n")
cat( "system.time( m[order(-m)] )        : ", system.time( m[order(-m)] ), "\n")
cat( "system.time( rev(sort(names(m))) ) : ", system.time( rev(sort(names(m))) ), "\n")

# length(m)                          :  100000 
# m[1:3]:
#     id.1     id.2     id.3 
# 89.99714 54.31701 11.57415 
#
# system.time( sort(m) )             :  0.011 0     0.011 0 0 
# system.time( m[order(-m)] )        :  0.011 0     0.011 0 0 
# system.time( rev(sort(names(m))) ) :  0.298 0.001 0.3   0 0 

Here are answers to questions by @raith:

  • "Is the m in the R code mutable?"
    No, R objects are mostly immutable.

  • "Does the sort(m) build a new data structure, or just a new index into the existing m structure?"
    A new data structure is created. R is a descendent of LISP, so it mostly follows, in spirit, the functional programming paradigm.

  • "Same question for m[order(-m)]?"
    order(-m) gives an integer vector (of indexes.) That vector of indexes is used to retrieve elements of m into a new object.

  • "And rev(sort(names(m)))?"
    names(m) takes the "keys" of m's elements. Those keys are sorted and placed into a character vector, and then that character vector is reversed. (I.e. a new object is created.)

  • "Presumably building just an index could be dramatically faster. Perhaps Raku could have a sort variant for Tuples that produces a Seq that relies on that approach?"
    I assume it is not my place to comment on this, but I would like to mention that:

    • Julia -- which is also a LISP descendant -- does something like that for its data structures. And Julia's creators and developers claim to be, in general, (much) faster than R and Mathematica. (Which are other LISP descendants for mathematical computations.)
    • I prefer and expect Raku code in the functional paradigm style to be fast.

Updated R benchmark

A few people requested more detailed R benchmarking:

library(microbenchmark)
set.seed(32)
words <- paste0( 'id.', 1:100000)
m <- setNames( runif(n = length(words), min = 0, max = 100), words)
cat( "length(m): ", length(m), "\n")
cat( "m[1:3]:\n"); print(m[1:3]); cat("\n")

microbenchmark::microbenchmark(
  sort(m, decreasing = T),
  sort(-m),
  m[order(-m)],
  rev(sort(m)),
  rev(sort(names(m))),
  unit = "s", times = 100
)

# length(m) :  100000
#
# m[1:3]:
#     id.1     id.2     id.3
# 50.58405 59.48084 80.87471
#
# Unit: seconds
#                     expr         min          lq        mean      median          uq        max neval cld
#  sort(m, decreasing = T) 0.006246853 0.007789205 0.009215613 0.008263348 0.009002414 0.02450786   100 a
#                 sort(-m) 0.006857755 0.008376058 0.010292145 0.008939605 0.010069702 0.02469324   100 a
#             m[order(-m)] 0.006658089 0.008257555 0.009726704 0.008718414 0.009811200 0.02294023   100 a
#             rev(sort(m)) 0.008975013 0.010939122 0.014965756 0.011692480 0.012571979 0.22022085   100  b
#      rev(sort(names(m))) 0.256036106 0.268526455 0.278385866 0.277794917 0.288586351 0.31160492   100   c
#
Anton Antonov
  • 726
  • 4
  • 11
  • 3
    I'll note that this doesn't seem to be particularly tied to `Mix`es. Directly sorting similar `Array` (e.g., `my @a = (1 .. 100_000).map({100.rand}); @a.sort(-*)`) has similar performance characteristics. I'll be interested to see the answers to this question. – codesections Mar 05 '21 at 18:11
  • 1
    @codesections Good point. I experiment with `Hash` objects and got similar outcomes, but did not try it out with `Array` objects. (Although, that is sort of covered with the `m4` line, `$m0.values.sort;`, in my post.) – Anton Antonov Mar 05 '21 at 18:31
  • 2
    I'm not sure we'll be able to match R in performance any time soon, but looking at a profile, it looks like the `Mu.Bool` proto and its dispatch to `Int.Bool` is not getting inlined (when it clearly should), and is thus responsible for about 50% of the CPU used in the example code. Investigating... – Elizabeth Mattijsen Mar 05 '21 at 18:35
  • 1
    @ElizabethMattijsen What command line and viewer did you use for the profile? Just `raku --profile program.raku` and load the result in a web browser or something more elaborate? – raiph Mar 05 '21 at 18:56
  • 1
    @raiph just that :-) – Elizabeth Mattijsen Mar 05 '21 at 19:38
  • 3
    @ElizabethMattijsen Thank you for your comment. And thank you for your "Raku - Sets without Borders" presentation! I found it inspiring... – Anton Antonov Mar 05 '21 at 20:05
  • 2
    https://github.com/rakudo/rakudo/commit/70a8e3661f improves the situation: timings before on my machine: time of .sort({-$_.value}): 2.8491583 time of .sort(-*.value) : 2.78743497 time of .sort.reverse : 3.508573 time of .values.sort : 3.05211257 and after: time of .sort({-$_.value}): 1.6635785 time of .sort(-*.value) : 1.568134 time of .sort.reverse : 2.061565934 time of .values.sort : 1.75952683 Thank you for bringing this to our attention! – Elizabeth Mattijsen Mar 06 '21 at 11:28
  • Per [R's `system.time`](https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/system.time), "should a garbage collection be performed immediately before the timing? Default is TRUE.". – raiph Mar 06 '21 at 12:14
  • Is the `m` in the R code mutable? Does the `sort(m)` build a new data structure, or just a new index into the existing `m` structure? Same question for `m[order(-m)]`? And `rev(sort(names(m)))`? Presumably building just an index could be dramatically faster. Perhaps Raku could have a `sort` variant for `Tuple`s that produces a `Seq` that relies on that approach? – raiph Mar 06 '21 at 12:14
  • 2
    Rakudo already optimizes sort for native arrays, and internally uses an in-place sort when it determines it can do so. Furthermore, I've identified some more optimisable comparators that even further speedup the op's code. – Elizabeth Mattijsen Mar 06 '21 at 12:45
  • 1
    @raiph Good R-code related questions -- I answered them in the post. (I have used R for a long time, so I am prone to *not* give sufficient background of its features...) – Anton Antonov Mar 06 '21 at 14:08
  • @ElizabethMattijsen Great to see the speedup commit! "internally uses an in-place sort when it determines it can do so." Presumably one of the inputs to that determination is escape analysis, right? Is spotting that a `sort` was triggered by code of the form `array.=sort` another input? If not, could it be? Do you see what I mean about the potentially dramatic advantage of building a `Seq` or some similar approach when sorting some immutable composite data structure? One day -- not this year, but one day? – raiph Mar 06 '21 at 15:04
  • @raiph One day, yeah :-) Meanwhile there's a good chance that the `@array .= sort` case will be optimizable to be in-place when RakuAST has landed. – Elizabeth Mattijsen Mar 06 '21 at 15:15
  • @ElizabethMattijsen "One day, yeah". Is that comment about escape analysis being used when determining it's OK to sort in place for a *mutable* structure, or building a `Seq` index when sorting an *immutable* structure? – raiph Mar 06 '21 at 15:18
  • 1
    @AntonAntonov Raku is clearly a heck of a lot slower. It might be helpful for better establishing the general takeaway from your benchmarks if you ran them, almost exactly as you have them, but repeating the expression 100 times (so, same list size -- 100,000, but repeat each sort 100 times) and, for the R code, using `system.time( expr-repeated-100-times, gcFirst = FALSE)` for the R timing, so allowing R's GC to go ahead, as it is with your Raku code, making the comparison more apples-to-apples. Rakudo *may* then be proportionally even slower, but it would be useful info one way or the other. – raiph Mar 06 '21 at 15:23
  • @raiph: you don't need escape analysis to be able to make `@a .= sort`: the AST gives that information already. I have been thinking on using a binary chop on uncached `Seq`s when sorting a `Seq`, so it would build the result while consuming the `Seq`. – Elizabeth Mattijsen Mar 06 '21 at 17:46
  • At the very least this is an imprecise comparison between Raku and R (and Mathematica, and Julia, etc). Within my albeit older version of R there are at least three `sort()` methods available: `shell`, `quick`, and `radix`. R's `sort()` function is related to `order()` and `rank()` which in turn depend on `xtfrm()`, an `internal generic primitive`. According to `?sort()` the `radix` sort method, (derived from the `data.table` package/authors), is generally considered the fastest. See: https://www.r-bloggers.com/2018/08/r-tip-consider-radix-sort/ – jubilatious1 Mar 06 '21 at 18:55
  • 1
    @jubilatious1 The point of my post was not a comparison between Raku and R. I put the R code and timings just to support my proclaimed speed expectations. – Anton Antonov Mar 06 '21 at 19:42
  • @ElizabethMattijsen Please put your comments into an answer -- then I can accept it. (Of course, I do like [the answer by @codesections](https://stackoverflow.com/a/66500447).) – Anton Antonov Mar 06 '21 at 19:50
  • @ElizabethMattijsen Thx. I had misunderstood what underpinned the determination. "you don't need escape analysis to be able to make `@a .= sort` [be in-place]" Right. The dev has asked to do an in-place sort, so go ahead and do it. "Rakudo already ... internally uses an in-place sort when it determines it can do so." is presumably the same deal of presuming devs know what they're doing, i.e. core Raku philosophy. I now realize that whether code is actually faulty due to having somehow shared a to-be-sorted data reference with another thread (i.e. it's escaped) is a separate unrelated problem. – raiph Mar 06 '21 at 23:44
  • @AntonAntonov Hi Anton, it would still be interesting to know your comparative timings if you added the parameter `, method = "radix"` to your `sort()` calls in R. Also, since you're interested in sorting in reverse order, R can do that in one call: simply add the parameter `, decreasing = TRUE` to your `sort()` calls. Thx. – jubilatious1 Mar 07 '21 at 03:48
  • 1
    @jubilatious1 The first R code I posted is supposed to resemble the Raku code. That is why I used in R `rev(sort(m))` -- it corresponds to `$m.values.sort.reverse` in Raku. But since you and @raiph requested more detailed R benchmarking, I updated my post with such code and results. – Anton Antonov Mar 07 '21 at 15:11

3 Answers3

8

[EDIT 2021-03-06: thanks to a series of commits over the past ~day (thanks, Liz!), this slowdown is now largely fixed on HEAD; these performance gains should show up in the next monthly release. I'm leaving the answer below as an example of how to dig into this sort of issue, but the specific problems it diagnosed have largely been resolved.]

Building on @Elizabeth Mattijsen's comment: The slow performance here is mostly due to the Rakudo compiler not properly optimizing the generated code (as of 2021-03-05). As the compiler continues to improve, the (idiomatic) code you wrote above should perform much better.

However, for today we can use a few workarounds to speed up this calculation. While it's still true that Raku's performance won't be particularly competitive with R here, some profiling-driven refactoring can make this code nearly an order of magnitude faster.

Here's how we do it:

First, we start by profiling the code. If you run your script with raku --profile=<filename>, then you'll get a profile written to <filename>. By default, this will be an HTML file that allows you to view the profile in your browser. My preference, however, is to specify an .sql extension, which generates an SQL profile. I then view this profile with MoarProf, the revised profiler that Timo Paulssen is building.

Looking at this profile shows exactly the issue that Liz mentioned: Calls that should be getting inlined are not. To fix this, let's create our own sorting function, which the JIT compiler will happily optimize:

sub my-reverse($a, $b) {
    $a.value > $b.value ?? Order::Less !! Order::More
}

Using this function (with $m0.sort(&my-reverse)) immediately shaves a good 25% off the runtime, but it's still way too high. Back to the profiler!

The next thing that jumps out at me is that we still have far too many calls to Bool. In particular, it looks like Rakudo is currently converting Ordering to a Bool. I think this is a bug and plan to look into it after posting this, but in any event, we can save Rakudo the effort:

sub my-reverse1($a, $b) {
    $a.value > $b.value ?? False !! True
}

On my machine, this cuts execution time in half again, getting us to ~28% of the original runtime of of .sort({-$_.value}). This is getting decent, and would be a fine place to stop.

Let's pres on, though: checking the profiler again shows that we're still devoting a very large chunk of our time to calling Bool (even though we're calling in half as often). To fix this at the moment, we'll need to drop down to NQP to compare numbers without constructing a Bool:

sub nqp-reverse($a, $b) {
    use nqp;
    nqp::isge_n($a.value, $b.value) ?? False !! True
}

This cuts our execution time in half again, and gets us to about the performance that I'd want out of Raku.

Here are the timing results I get, both for the functions I added and the ones in your question, reported in the same format you used:

.elems: 100000
.head(3): (id.195 => 80.81458886131459 id.31126 => 84.25690944480021 id.60237 => 45.63311676798485)
time of .sort(&nqp-reverse): 0.3226533
time of .sort(&my-reverse1): 0.76803384
time of .sort(&my-reverse) : 1.4643238
time of .sort({-$_.value}) : 2.6780952
time of .sort(-*.value)    : 1.8549689
time of .sort.reverse      : 2.5862973
time of .values.sort       : 2.078715
codesections
  • 8,900
  • 16
  • 50
8

There was a deficiency in the way comparisons were made with .sort, which resulted in a lot of calls to the Mu.Bool method, which in some cases would make up about 50% of CPU used. Today I have changed the handling of .sort for non-standard comparisons in such a way that the call to Mu.Bool is no longer made. This resulted in the code of the question to run about 2x as fast as before, although @codesections reported a 4x speed increase. This should be in the 2020.03 release.

Elizabeth Mattijsen
  • 25,654
  • 3
  • 75
  • 105
1

On a slight aside, there are some Raku idioms that can streamline a bit...

my @words = Array(1 .. 100_000).map({ 'id.' ~ $_.Str });
my $m0 = Mix(@words.map({ $_ => 100.rand() }));

to

my $m0 = .map(* => 100.rand).Mix given 'id.' X~ 1..100_000;

this employs the 'given' formulation and the X metaoperator with '~' string concatenation.

librasteve
  • 6,832
  • 8
  • 30