3

I am still teaching some R mainly to myself (and to my students).

Here's an implementation of the Collatz sequence in R:

f <- function(n)
{
    # construct the entire Collatz path starting from n
    if (n==1) return(1)
    if (n %% 2 == 0) return(c(n, f(n/2)))
    return(c(n, f(3*n + 1)))
}

Calling f(13) I get 13, 40, 20, 10, 5, 16, 8, 4, 2, 1

However note that a vector is growing dynamically in size here. Such moves tend to be a recipe for inefficient code. Is there a more efficient version?

In Python I would use

def collatz(n):
    assert isinstance(n, int)
    assert n >= 1

    def __colla(n):

        while n > 1:
            yield n

            if n % 2 == 0:
                n = int(n / 2)
            else:
                n = int(3 * n + 1)

        yield 1

    return list([x for x in __colla(n)])

I found a way to write into vectors without specifying their dimension a priori. Therefore a solution could be

collatz <-function(n)
{
  stopifnot(n >= 1)  
  # define a vector without specifying the length
  x = c()

  i = 1
  while (n > 1)
  {
    x[i] = n
    i = i + 1
    n = ifelse(n %% 2, 3*n + 1, n/2)
  }
  x[i] = 1
  # now "cut" the vector
  dim(x) = c(i)
  return(x)
}
tschm
  • 2,905
  • 6
  • 33
  • 45
  • 2
    [a-look-a-r-vectorization-through-the-collatz-conjecture](http://blog.revolutionanalytics.com/2014/04/a-look-a-r-vectorization-through-the-collatz-conjecture.html) – Patrick Artner Oct 03 '18 at 20:53
  • 2
    [r-and-the-collatz-conjecture-part-2.html](http://blog.revolutionanalytics.com/2014/05/r-and-the-collatz-conjecture-part-2.html) – Patrick Artner Oct 03 '18 at 20:53
  • 2
    `google R collatz efficient` - first 2 hits - should help :) got no idea about R – Patrick Artner Oct 03 '18 at 20:54
  • Unfortunately not really. I have seen those links before. They don't even attempt to construct explicitly the sequence of numbers in the conjecture. They just update a counter to get the length of the sequence(s). They run for a vector of arguments. Nevertheless, very interesting. Thank you... – tschm Oct 03 '18 at 21:04

1 Answers1

4

I was curious to see how a C++ implementation through Rcpp would compare to your two base R approaches. Here are my results.

First let's define a function collatz_Rcpp that returns the Hailstone sequence for a given integer n. The (non-recursive) implementation was adapted from Rosetta Code.

library(Rcpp)
cppFunction("
    std::vector<int> collatz_Rcpp(int i) {
        std::vector<int> v;
        while(true) {
            v.push_back(i);
            if (i == 1) break;
            i = (i % 2) ? (3 * i + 1) : (i / 2);
        }
        return v;
    }
")

We now run a microbenchmark analysis using both your base R and the Rcpp implementation. We calculate the Hailstone sequences for the first 10000 integers

# base R implementation
collatz_R <- function(n) {
    # construct the entire Collatz path starting from n
    if (n==1) return(1)
    if (n %% 2 == 0) return(c(n, collatz(n/2)))
    return(c(n, collatz(3*n + 1)))
}

# "updated" base R implementation
collatz_R_updated <-function(n) {
  stopifnot(n >= 1)
  # define a vector without specifying the length
  x = c()
  i = 1
  while (n > 1) {
    x[i] = n
    i = i + 1
    n = ifelse(n %% 2, 3*n + 1, n/2)
  }
  x[i] = 1
  # now "cut" the vector
  dim(x) = c(i)
  return(x)
}

library(microbenchmark)
n <- 10000
res <- microbenchmark(
    baseR = sapply(1:n, collatz_R),
    baseR_updated = sapply(1:n, collatz_R_updated),
    Rcpp = sapply(1:n, collatz_Rcpp))

res
#         expr        min         lq       mean     median         uq       max
#        baseR   65.68623   73.56471   81.42989   77.46592   83.87024  193.2609
#baseR_updated 3861.99336 3997.45091 4240.30315 4122.88577 4348.97153 5463.7787
#         Rcpp   36.52132   46.06178   51.61129   49.27667   53.10080  168.9824

library(ggplot2)
autoplot(res)

enter image description here

The (non-recursive) Rcpp implementation seems to be around 30% faster than the original (recursive) base R implementation. The "updated" (non-recursive) base R implementation is significantly slower than the original (recursive) base R approach (the microbenchmark takes around 10 minutes to finish on my MacBook Air due to baseR_updated).

Maurits Evers
  • 49,617
  • 4
  • 47
  • 68
  • Wow, fascinating results. Many thanks. Now, I try to keep away from C++ as it adds yet another layer of complexity. I am very surprised by the differences between baseR and baseR_updated. I assumed baseR_updated to be faster than baseR. So baseR wins both for efficiency and elegance (over baseR_updated...) – tschm Oct 04 '18 at 04:29
  • 1
    Thanks for an interesting question @tschm. I guess the conclusion is that the recursive algorithm (`baseR`) is very fast, despite dynamically growing a vector in R. The `Rcpp` approach is not recursive, and is significantly faster than the equivalent base R approach `baseR_updated`. A recursive `Rcpp` version would probably be the winner;-) – Maurits Evers Oct 07 '18 at 10:38
  • I guess the next stop will be caching in particular if you compute 10000 sequences as you do.... – tschm Oct 07 '18 at 17:52
  • I guess it's not trivial to cache a function in R. I could do m_f<-memoise(f) but in this case the recursive calls of f in f are taking advantage of caching... – tschm Oct 07 '18 at 18:13