1

I use the R-package adabag to fit boosted trees to a (large) data set (140 observations with 3 845 predictors).

I executed this method twice with same parameter and same data set and each time different values of the accuracy returned (I defined a simple function which gives accuracy given a data set). Did I make a mistake or is usual that in each fitting different values of the accuracy return? Is this problem based on the fact that the data set is large?

function which returns accuracy given the predicted values and true test set values.

    err<-function(pred_d, test_d)
{
  abs.acc<-sum(pred_d==test_d)
  rel.acc<-abs.acc/length(test_d)

  v<-c(abs.acc,rel.acc)

  return(v)
}

new Edit (9.1.2017): important following question of the above context.

As far as I can see I do not use any "pseudo randomness objects" (such as generating random numbers etc.) in my code, because I essentially fit trees (using r-package rpart) and boosted trees (using r-package adabag) to a large data set. Can you explain me where "pseudo randomness" enters, when I execute my code?

Edit 1: Similar phenomenon happens also with tree (using the R-package rpart).

Edit 2: Similar phenomenon did not happen with trees (using rpart) on the data set iris.

bjn
  • 195
  • 1
  • 7
  • i think you have to use `set.seed` in order to get the same results. – Chirayu Chamoli Dec 24 '16 at 03:00
  • Yes, there's no reason you should expect to get the same results if you didn't set your seed. – Hack-R Dec 24 '16 at 04:37
  • @ChirayuChamoli Unfortunately, I am unfamiliar with this function. Can I place it anywhere in the to be executed code? Which value should I set (e.g. set.seed(1))? – bjn Dec 24 '16 at 04:37
  • It doesn't matter what seed you set if you're doing statistics rather than information security. You might run your model with several different seeds to check its sensitivity. You just have to set it before anything involving pseudo randomness. Most people set it at the beginning of their code. This is ubiquitous in statistics; it affects all probabilistic models and processes across all languages. – Hack-R Dec 24 '16 at 04:39
  • @Hack-R Thanks for your quick answer. Suppose somebody else open the code on another computer (, where now set.seed(1) is given at the beginning of the executed code). Does he get the same results as I? Which things do I need to set that he gets the same results as I? – bjn Dec 24 '16 at 04:45
  • Yes, he would get the same results – Hack-R Dec 24 '16 at 04:47
  • I added a crucial important question for understanding the problem of my post. Please see above, new edit (9.1.2017). – bjn Jan 09 '17 at 15:26

1 Answers1

1

There's no reason you should expect to get the same results if you didn't set your seed (with set.seed()).

It doesn't matter what seed you set if you're doing statistics rather than information security. You might run your model with several different seeds to check its sensitivity. You just have to set it before anything involving pseudo randomness. Most people set it at the beginning of their code.

This is ubiquitous in statistics; it affects all probabilistic models and processes across all languages.

Note that in the case of information security it's important to have a (pseudo) random seed which cannot be easily guessed by brute force attacks, because (in a nutshell) knowing a seed value used internally by a security program paves the way for it to be hacked. In science and statistics it's the opposite - you and anyone you share your code/research with should be aware of the seed to ensure reproducibility.

https://en.wikipedia.org/wiki/Random_seed

http://www.grasshopper3d.com/forum/topics/what-are-random-seed-values

Hack-R
  • 22,422
  • 14
  • 75
  • 131
  • Thank you very much it clarifies a lot. Only to be completely sure (as a lot of work depend on it), if the other guy on another computer executes the same code with the same seed value (e.g. set.seed(1)), then he gets the same results than I do? – bjn Dec 24 '16 at 05:06
  • After some thinking, for me is unclear where I use pseudo-randomness in my code. Essentially I train two methods, boosted trees using adabag and trees using rpart, based on some data set. Where do I use pseudo randomness? – bjn Jan 03 '17 at 14:46
  • Yeah, but you should get directionally similar results, unless you're nowhere near converged. I like to plot `y~yhat` to check my fit, but something like `yhat1~yhat2` can give you an indication of model stability. – geneorama Jan 03 '17 at 21:50
  • @geneorama I'm not sure, if I understand you correctly. (Sorry, I'm not a computer scientist). There is somewhere pseudo-randomness implicitly used (, although my programming code does not contain anything like that) in my code, if I execute it. Why does this used? Has that something to do with the very large size of my data set? (as some kind of "stochastic" approximation?) – bjn Jan 07 '17 at 15:12
  • Please note that, I added a new subsequent question of pseudo-randomness in my initial post (basically, "Where does pseude-randomness enters?"). Let me know, if something is unclear. – bjn Jan 10 '17 at 18:21
  • @bjn I don't understand your reply. My point was that if you're trying to predict something real, and run the same model twice, you should get a similar result no matter what's happening with your random noise, e.g. you're "most likely to fail" record shouldn't suddenly become the "most likely to succeed" record. – geneorama Jan 10 '17 at 20:04
  • @geneorama For me it is somewhat unclear, where the noise enters the data, if I run twice a fitting method on exactly the same data and then get two different test accuracy. In this context I do not see, where I use random objects (such as noise). – bjn Feb 03 '17 at 14:46
  • @bjn If you run this example repeatedly: `lm(y~x, data.frame(x=1:1000, y=rnorm(1000)))` the coefficients are around zero, but because of the noise it's not exactly zero. My first point was that the answer to your problem should be stable but not identical when repeated. With bootstrap aggregating (bagging) the data is resampled, *resampled* as in *randomly* resampled. That's a primary source for the randomness, maybe not the only source I'm not sure (and as a practitioner I don't care much). – geneorama Feb 07 '17 at 18:55
  • @geneorama Now I understand, where the noise enters in fitting of boosted trees of the R-package adabag. Thank you. I also reread the paper "adabag: an R package for Classification with Boosting and Bagging" and in the boosting algorithm it is a bootstrap sample drawn using the weight for each observation on that iteration. If I understand it correctly this means that a bootstrap sample is used for determining the weights in each iteration instead of the actual data. Is that correct? – bjn Feb 07 '17 at 19:10
  • @geneorama Please see also my post http://stats.stackexchange.com/questions/260512/using-boostrap-for-boosted-trees-r-package-adabag – bjn Feb 07 '17 at 19:12
  • I just want to point out that any time you execute a piece of code and one needs to set seeds at the beginning. – bjn Feb 07 '17 at 19:14