5

We are facing a weird situation in our code when using R's runif and setting seed with set.seed with the kind = NULL option (which resolves, unless I am mistaken, to kind = "default"; the default being "Mersenne-Twister").

We set the seed using (8 digit) unique IDs generated by an upstream system, before calling runif:

seeds = c(
  "86548915", "86551615", "86566163", "86577411", "86584144", 
  "86584272", "86620568", "86724613", "86756002", "86768593", "86772411", 
  "86781516", "86794389", "86805854", "86814600", "86835092", "86874179", 
  "86876466", "86901193", "86987847", "86988080")

random_values = sapply(seeds, function(x) {
  set.seed(x)
  y = runif(1, 17, 26)
  return(y)
})

This gives values that are extremely bunched together.

> summary(random_values)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  25.13   25.36   25.66   25.58   25.83   25.94 

This behaviour of runif goes away when we use kind = "Knuth-TAOCP-2002", and we get values that appear to be much more evenly spread out.

random_values = sapply(seeds, function(x) {
  set.seed(x, kind = "Knuth-TAOCP-2002")
  y = runif(1, 17, 26)
  return(y)
})

Output omitted.


The most interesting thing here is that this does not happen on Windows -- only happens on Ubuntu (sessionInfo output for Ubuntu & Windows below).

Windows output:

> seeds = c(
+   "86548915", "86551615", "86566163", "86577411", "86584144", 
+   "86584272", "86620568", "86724613", "86756002", "86768593", "86772411", 
+   "86781516", "86794389", "86805854", "86814600", "86835092", "86874179", 
+   "86876466", "86901193", "86987847", "86988080")
> 
> random_values = sapply(seeds, function(x) {
+   set.seed(x)
+   y = runif(1, 17, 26)
+   return(y)
+ })
> 
> summary(random_values)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  17.32   20.14   23.00   22.17   24.07   25.90 

Can someone help understand what is going on?

Ubuntu

R version 3.4.0 (2017-04-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.2 LTS

Matrix products: default
BLAS: /usr/lib/libblas/libblas.so.3.6.0
LAPACK: /usr/lib/lapack/liblapack.so.3.6.0

locale:
[1] LC_CTYPE=en_US.UTF-8          LC_NUMERIC=C                 
 [3] LC_TIME=en_US.UTF-8           LC_COLLATE=en_US.UTF-8       
 [5] LC_MONETARY=en_US.UTF-8       LC_MESSAGES=en_US.UTF-8      
 [7] LC_PAPER=en_US.UTF-8          LC_NAME=en_US.UTF-8          
 [9] LC_ADDRESS=en_US.UTF-8        LC_TELEPHONE=en_US.UTF-8     
[11] LC_MEASUREMENT=en_US.UTF-8    LC_IDENTIFICATION=en_US.UTF-8

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] RMySQL_0.10.8               DBI_0.6-1                  
 [3] jsonlite_1.4                tidyjson_0.2.2             
 [5] optiRum_0.37.3              lubridate_1.6.0            
 [7] httr_1.2.1                  gdata_2.18.0               
 [9] XLConnect_0.2-12            XLConnectJars_0.2-12       
[11] data.table_1.10.4           stringr_1.2.0              
[13] readxl_1.0.0                xlsx_0.5.7                 
[15] xlsxjars_0.6.1              rJava_0.9-8                
[17] sqldf_0.4-10                RSQLite_1.1-2              
[19] gsubfn_0.6-6                proto_1.0.0                
[21] dplyr_0.5.0                 purrr_0.2.4                
[23] readr_1.1.1                 tidyr_0.6.3                
[25] tibble_1.3.0                tidyverse_1.1.1            
[27] rBayesianOptimization_1.1.0 xgboost_0.6-4              
[29] MLmetrics_1.1.1             caret_6.0-76               
[31] ROCR_1.0-7                  gplots_3.0.1               
[33] effects_3.1-2               pROC_1.10.0                
[35] pscl_1.4.9                  lattice_0.20-35            
[37] MASS_7.3-47                 ggplot2_2.2.1              

loaded via a namespace (and not attached):
[1] splines_3.4.0      foreach_1.4.3      AUC_0.3.0          modelr_0.1.0      
 [5] gtools_3.5.0       assertthat_0.2.0   stats4_3.4.0       cellranger_1.1.0  
 [9] quantreg_5.33      chron_2.3-50       digest_0.6.10      rvest_0.3.2       
[13] minqa_1.2.4        colorspace_1.3-2   Matrix_1.2-10      plyr_1.8.4        
[17] psych_1.7.3.21     XML_3.98-1.7       broom_0.4.2        SparseM_1.77      
[21] haven_1.0.0        scales_0.4.1       lme4_1.1-13        MatrixModels_0.4-1
[25] mgcv_1.8-17        car_2.1-5          nnet_7.3-12        lazyeval_0.2.0    
[29] pbkrtest_0.4-7     mnormt_1.5-5       magrittr_1.5       memoise_1.0.0     
[33] nlme_3.1-131       forcats_0.2.0      xml2_1.1.1         foreign_0.8-69    
[37] tools_3.4.0        hms_0.3            munsell_0.4.3      compiler_3.4.0    
[41] caTools_1.17.1     rlang_0.1.1        grid_3.4.0         nloptr_1.0.4      
[45] iterators_1.0.8    bitops_1.0-6       tcltk_3.4.0        gtable_0.2.0      
[49] ModelMetrics_1.1.0 codetools_0.2-15   reshape2_1.4.2     R6_2.2.0          
[53] knitr_1.15.1       KernSmooth_2.23-15 stringi_1.1.5      Rcpp_0.12.11  

Windows

> sessionInfo()
R version 3.3.2 (2016-10-31)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

locale:
[1] LC_COLLATE=English_India.1252  LC_CTYPE=English_India.1252    LC_MONETARY=English_India.1252
[4] LC_NUMERIC=C                   LC_TIME=English_India.1252    

attached base packages:
[1] graphics  grDevices utils     datasets  grid      stats     methods   base     

other attached packages:
 [1] bindrcpp_0.2         h2o_3.14.0.3         ggrepel_0.6.5        eulerr_1.1.0         VennDiagram_1.6.17  
 [6] futile.logger_1.4.3  scales_0.4.1         FinCal_0.6.3         xml2_1.0.0           httr_1.3.0          
[11] wesanderson_0.3.2    wordcloud_2.5        RColorBrewer_1.1-2   htmltools_0.3.6      urltools_1.6.0      
[16] timevis_0.4          dtplyr_0.0.1         magrittr_1.5         shiny_1.0.5          RODBC_1.3-14        
[21] zoo_1.8-0            sqldf_0.4-10         RSQLite_1.1-2        gsubfn_0.6-6         proto_1.0.0         
[26] gdata_2.17.0         stringr_1.2.0        XLConnect_0.2-12     XLConnectJars_0.2-12 data.table_1.10.4   
[31] xlsx_0.5.7           xlsxjars_0.6.1       rJava_0.9-8          readxl_0.1.1         googlesheets_0.2.1  
[36] jsonlite_1.5         tidyjson_0.2.1       RMySQL_0.10.9        RPostgreSQL_0.4-1    DBI_0.5-1           
[41] dplyr_0.7.2          purrr_0.2.3          readr_1.1.1          tidyr_0.7.0          tibble_1.3.3        
[46] ggplot2_2.2.0        tidyverse_1.0.0      lubridate_1.6.0     

loaded via a namespace (and not attached):
 [1] gtools_3.5.0         assertthat_0.2.0     triebeard_0.3.0      cellranger_1.1.0     yaml_2.1.14         
 [6] slam_0.1-40          lattice_0.20-34      glue_1.1.1           chron_2.3-48         digest_0.6.12.1     
[11] colorspace_1.3-1     httpuv_1.3.5         plyr_1.8.4           pkgconfig_2.0.1      xtable_1.8-2        
[16] lazyeval_0.2.0       mime_0.5             memoise_1.0.0        tools_3.3.2          hms_0.3             
[21] munsell_0.4.3        lambda.r_1.1.9       rlang_0.1.1          RCurl_1.95-4.8       labeling_0.3        
[26] bitops_1.0-6         tcltk_3.3.2          gtable_0.2.0         reshape2_1.4.2       R6_2.2.0            
[31] bindr_0.1            futile.options_1.0.0 stringi_1.1.2        Rcpp_0.12.12.1      
tchakravarty
  • 10,736
  • 12
  • 72
  • 116
  • 2
    What is the "upstream process" that generates the seeds. This seems to be an artefact of the particular set of seeds you supply, and is not generally true for a random set of seeds of similar magnitude – dww Nov 02 '17 at 16:20
  • If I run your code on Windows I get the same results. I used `3.4.1` on Windows 10. There usually isn't a difference in random number generation between platforms. Unless i'm misunderstanding exactly what you mean when you say "that this does not happen" – MrFlick Nov 02 '17 at 16:28
  • @dww Sure, we tested them on other 8-digit numbers as well & we could not replicate. However, these are honest-to-goodness numbers generated by a non-adversarial system that has no conception of these numbers being used for anything other than a unique key for an entity -- these are not a specially constructed edge case. Would be good to know what seeds will and will not work, and why. – tchakravarty Nov 02 '17 at 17:30
  • @MrFlick I can give you the `sessionInfo` on my Windows 10 machine running R 3.3.2 where the random numbers do not, as in this question, appear to be non-randomly distributed. – tchakravarty Nov 02 '17 at 17:30
  • Also, bizarre that this got downvoted. – tchakravarty Nov 02 '17 at 17:31
  • @tchakravarty yes. supply the session info that gives different results on windows and show those results. With the same seed, the results should be reproducible on all versions of R. – MrFlick Nov 02 '17 at 18:54
  • @MrFlick Requested `sessionInfo` & code output has been added. – tchakravarty Nov 03 '17 at 04:49
  • I get your results when I run it on Windows. Interestingly, when I add 1 to all of the numbers in `seed` the effect vanishes. I agree with @dww that your unspecified "upstream process" is somehow responsible. My guess is that they were themselves constructed by use of the Mersenne Twister. Please post the code that generates them. – John Coleman Nov 03 '17 at 18:10
  • @JohnColeman I have clarified above that these are form IDs generated in a sequence (not randomly) by an application ("upstream system") which calls my API within which the R code gets called. The form IDs for which the API gets called depends on the user inputs to the form, which is definitely not a Mersenne Twister. – tchakravarty Nov 03 '17 at 18:30
  • The numbers in your seed list are not sequential. Without understanding a bit more about where these numbers come from your question (while interesting) seems impossible to answer. – John Coleman Nov 03 '17 at 18:51

3 Answers3

3

NOTE: This answer summarizes elements of a discussion of this question that took place on the R-devel mailing list. I've merely tried to capture and summarize the ideas originally articulated there.

Despite your assurances that these numbers are not a specially constructed edge case, they have every appearance of being just that. Here is the original sequence plus code to check the distribution of values produced:

seeds = c(
    86548915, 86551615, 86566163, 86577411, 86584144, 86584272,
    86620568, 86724613, 86756002, 86768593, 86772411, 86781516,
    86794389, 86805854, 86814600, 86835092, 86874179, 86876466,
    86901193, 86987847, 86988080)
checkit <- function(seeds) {
    sapply(seeds, function(x) {
        set.seed(x)
        y = runif(1, 17, 26)
        return(y)
    })}

The original sequence shows surprisingly little variation, as noted:

  summary(checkit(seeds+0))
  ## Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  ##25.13   25.36   25.66   25.58   25.83   25.94 

There does appear to be something special about the original sequence, since minimal modifications to it do not produce the same surprising result:

summary(checkit(seeds+1))
## Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 17.18   19.65   22.75   22.02   24.37   25.79

summary(checkit(seeds-1))
## Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##17.15   18.44   19.92   20.77   22.97   25.95 

Out of the all the seeds in the range spanned by the original sequence, the expected number produce values in the observed range:

possible.seeds <- min(seeds):max(seeds)

s25 <- Filter(function(s){
    set.seed(s)
    x <- runif(1,17,26)
    x > 25.12 & x < 25.95},
    possible.seeds)

length(s25)/length(possible.seeds)
##[1] 0.09175801

And yet, all the values in the original sequence are in this subset (of course we knew this already...).

table(seeds %in% s25)

##TRUE 
##  21 

All of which points to the likely possibility that the original sequence was in fact a (perhaps unintentionally) specially constructed edge case.

Ista
  • 10,139
  • 2
  • 37
  • 38
2

As further evidence that your sequence is an edge-case, you can focus on the range of the putatively random values was constructed. The 17 and 26 are a bit of a distraction. Repeating your experiment with uniform on 0 and 1 generates something equally as unlikely:

f <- function(x) {
  set.seed(x)
  runif(1)
}

 check_range <-function(seeds){
   vals <- sapply(seeds,f)
   max(vals)-min(vals)
}

When run against your seeds:

> check_range(seeds)
[1] 0.09026112

A reasonable model for check_range(seeds) when run on 21 random seeds is that it is the sample range of a random sample of size 21 drawn U(0,1). Its theoretical density is given by:

f <- function(x){420*x^19*(1-x)}

We can use this to calculate the probability of observing a range of 0.09 or smaller:

> integrate(f,0,0.09)
2.334272e-20 with absolute error < 2.6e-34

As a check that it is reasonable to model the sample range when seeding the Mersenne Twister like that, you can do the following experiment:

ranges <- replicate(1000,check_range(sample(8548915:86988080,21)))
x <- seq(0,1,0.01)
y <- f(x)
hist(ranges,freq = FALSE,xlim =c(0,1))
points(x,y,type = "l")
abline(v=0.09)

Output:

enter image description here

The density histogram follows the theoretical density reasonably well. The set of 21 seeds from your question represents an extreme outlier. It is very unlikely to be due to chance and is also unlikely to be due to some underlying flaw in the Mersenne Twister. The most likely explanation is that the Mersenne Twister itself was involved in producing those 21 values (but not of course in the naïve way of simply using sample() to draw 21 values).

John Coleman
  • 51,337
  • 7
  • 54
  • 119
1

When you're using Mersenne Twister with a single seed, it's reasonable to assume that the generated values are approximately independent and identically distributed. Unfortunately, there are no guarantees about the generated values from two streams started at different seeds. See, for example, this SC thread.

In your situation, I would recommend either using one of the seed selection strategies suggested in the SC thread, or else switching to a PRNG with better guarantees about parallel streams. One option is L'Ecuyer's "RngStreams" generator:

set.seed(0, kind = "L'Ecuyer-CMRG")

Even with that PRNG, I don't know if it still holds that you can seed the PRNG with arbitrary seeds and get roughly independent streams.

As far as the difference between Ubuntu and Windows, it's possible that one of these systems is using a 32-bit generator, and the other is using 64-bit.

Patrick Perry
  • 1,422
  • 8
  • 17