11

is there a smart way to identify all functions that use .Random.seed (the random number generator state within R) at any point in an R script?

use case: we have a dataset that changes constantly, both the records [rows] and the information [columns] - we add new records often, but we also update information in certain columns. so the dataset is constantly in flux. we fill in some missing data with an imputation, which requires random number generation with the sample() function. so whenever we add a new row or update any information in the column, the randomly imputed numbers all change -- which is expected. we use set.seed() at the start of each random imputation, so if a column changes but zero rows change, the other randomly-generated columns are not affected.

i'm under the impression that the only function within our entire codebase that ever touches a random seed is the sample() function, but i would like to verify this somehow?

edit: even something that prints a function call whenever the random number state gets touched would be helpful, the same way debug() comes to life whenever the debugged function gets triggered? for our purposes, it is pretty safe to assume that if we run our script once for dynamic evaluation and no other random functions get triggered, then we are safe.

thanks

Anthony Damico
  • 5,779
  • 7
  • 46
  • 77
  • I have to guess that the lazy person who downvoted you think you may be able to simplify your question further. Unfortunately it's still possible to downvote a question without giving a clue about the reasons why. Good luck! – Rodrigo Apr 26 '17 at 15:48
  • No, there’s fundamentally no way. Since R is dynamically evaluated, you **cannot** write a static analyser to check for this comprehensively. Heuristics might get you close, but building one for this will be quite difficult. I’d say you’re out of luck. If you want to ensure that nothing touches the state of your random number generator, don’t rely on R’s, write a small C++ function that uses a generator based on the `` standard header, and import that function in R. – Konrad Rudolph Apr 26 '17 at 15:51
  • thanks @KonradRudolph i added one edit but sorry if i'm misunderstanding something.. – Anthony Damico Apr 26 '17 at 15:57
  • It's not user-friendly, but the only thing I can think of is to check `.Random.seed` throughout your script (save the results in a matrix or something) and look for when it changes. As for a code scanner/function checker, I agree with Konrad that it's probably impossible. Too easy to come up with pathological examples. – Gregor Thomas Apr 26 '17 at 16:05
  • @AnthonyDamico: I noticed that you spent most of your rep. If you ever need a question to be "bountied", I'd be happy to use some my rep to give your questions added notice. I suspect you know my "real name". – IRTFM May 02 '17 at 23:55
  • @42- haha thanks david :) btw, check out my next project.. `devtools::install_github("ajdamico/lodown");library(lodown);?lodown` – Anthony Damico May 03 '17 at 02:54
  • I noticed a comment in the example for download of SEER data that the catalog shows "nothing meaningful". Did you get a valid user name and PWD? They require a signed commitment agreement. They need to be renewed for each release cycle (which I've done most years since 2002.) – IRTFM May 03 '17 at 03:23
  • hi @42- , seer is a single file download so the `get_catalog` step serves no purpose :) – Anthony Damico May 03 '17 at 14:05

1 Answers1

19

Notwithstanding my comment, here’s a brute force way of checking this:

rm(.Random.seed) # if it already exists
makeActiveBinding('.Random.seed',
                  function () stop('Something touched my seed', call. = FALSE),
                  globalenv())

This will make .Random.seed into an active binding that throws an error when it’s touched.

This works but it’s very disruptive. Here’s a gentler variant. It has a few interesting features:

  • It allows enabling and disabling debugging of .Random.seed
  • It supports getting and setting the seed
  • It logs the call but doesn’t stop execution
  • It maintains a “whitelist” of top-level calls that shouldn’t be logged

With this you can write the following code, for instance:

# Ignore calls coming from sample.int
> debug_random_seed(ignore = sample.int)

> sample(5)
Getting .Random.seed
Called from sample(5)
Setting .Random.seed
Called from sample(5)
[1] 3 5 4 1 2

> sample.int(5)
[1] 5 1 2 4 3

> undebug_random_seed()

> sample(5)
[1] 2 1 5 3 4

Here is the implementation in all its glory:

debug_random_seed = local({
    function (ignore) {
        seed_scope = parent.env(environment())

        if (is.function(ignore)) ignore = list(ignore)

        if (exists('.Random.seed', globalenv())) {
            if (bindingIsActive('.Random.seed', globalenv())) {
                warning('.Random.seed is already being debugged')
                return(invisible())
            }
        } else {
            set.seed(NULL)
        }

        # Save existing seed before deleting
        assign('random_seed', .Random.seed, seed_scope)
        rm(.Random.seed, envir = globalenv())

        debug_seed = function (new_value) {
            if (sys.nframe() > 1 &&
                ! any(vapply(ignore, identical, logical(1), sys.function(1)))
            ) {
                if (missing(new_value)) {
                    message('Getting .Random.seed')
                } else {
                    message('Setting .Random.seed')
                }
                message('Called from ', deparse(sys.call(1)))
            }

            if (! missing(new_value)) {
                assign('random_seed', new_value, seed_scope)
            }

            random_seed
        }

        makeActiveBinding('.Random.seed', debug_seed, globalenv())
    }
})

undebug_random_seed = function () {
    if (! (exists('.Random.seed', globalenv()) &&
           bindingIsActive('.Random.seed', globalenv()))) {
        warning('.Random.seed is not being debugged')
        return(invisible())
    }

    seed = suppressMessages(.Random.seed)
    rm('.Random.seed', envir = globalenv())
    assign('.Random.seed', seed, globalenv())
}

Some notes about the code:

  • The debug_random_seed function is defined inside its own private environment. This environment is designated by seed_scope in the code. This prevents leaking the private random_seed variable into the global environment.
  • The function defensively checks whether debugging is already enabled. Overkill maybe.
  • Debug information is only printed when the seed is accessed within a function call. If the user inspects .Random.seed directly on the R console, no logging occurs.
Konrad Rudolph
  • 530,221
  • 131
  • 937
  • 1,214
  • this is nearly perfect. is there any way to get the `function() stop()` to ignore the `sample()` function? i think an ugly solution like `if( !any( unlist( lapply( c( "sample" , "runif" ) , function( w ) grepl( w , paste( as.character( sys.calls() ) , collapse = "" ) ) ) ) ) )` would work there to skip those two functions, but maybe there's a cleaner approach – Anthony Damico Apr 26 '17 at 16:39
  • @AnthonyDamico It works in a pinch. I’d prefer to compare actual *call* symbols for rather than similar-looking strings but I’m unfortunately on the mobile now so I can’t write an example. – Konrad Rudolph Apr 26 '17 at 16:44
  • 3
    @AnthonyDamico Apparently I have nothing better to do. ;-) Check the new code in the answer. This should satisfy your every dream. – Konrad Rudolph Apr 27 '17 at 10:10