0

I would like to simulate Revenues Scenarios upon: price and est_p (estimated probability) from the following df:

df <- data.frame(price        = c(200, 100, 600, 20, 100),
                 est_p        = c(0.9, 0.2, 0.8, 0.5, 0.6),
                 actual_sale  = c(FALSE, TRUE, TRUE, TRUE, TRUE))

Revenue is - sum of prices where actual_sale is TRUE:

print(actual1 <- sum(df$price[df$actual_sale])) # Actual Revenue

[1] 820

I've created a function to simulate Bernoulli trials upon est_p and price values:

bernoulli <- function(df) {
        sapply(seq(nrow(df)), function(x) {
                prc <- df$price[x]
                p   <- df$est_p[x]
                sample(c(prc, 0), size = 1000, replace = T, prob = c(p, 1 - p))
                })
}

And applied it to a sample df:

set.seed(100)
distr1 <- rowSums(bernoulli(df))
quantile(distr1)

  0%  25%  50%  75% 100% 
   0  700  820  920 1020 

Looks OK, actual value = median! But when I apply the same function to increased (replicated x 1000 times) sample - df1000, actual Revenue is out of bounds of simulated values:

df1000 <- do.call("rbind", replicate(1000, df, simplify = FALSE))

print(actual2 <- sum(df1000$price[df1000$actual_sale])) 

[1] 820000

distr2 <- rowSums(bernoulli(df1000))
quantile(distr2)

    0%    25%    50%    75%   100% 
726780 744300 750050 754920 775800

Why does the actual revenue is out of the range of simulated values? Where did I make a mistake and what is the correct solution to this problem?

George Shimanovsky
  • 1,668
  • 1
  • 17
  • 15
  • Why are you changing the seed values? – deepseefan Aug 26 '19 at 00:43
  • Well just in case. Even If I wouldn't it makes no difference to the problem. – George Shimanovsky Aug 26 '19 at 00:44
  • It would, check this out then, `set.seed(100)` and run `distr1 <- rowSums(sim(df))` followed by `quantile(distr1)` and change the seed to 200 and run the above again see the output, it gives you a different output. The idea of the seed is to make a reproducible random sample. – deepseefan Aug 26 '19 at 00:47
  • The problem is revenue out of the bounds no matter set seed 100 or 200. This is what I meant. – George Shimanovsky Aug 26 '19 at 00:49
  • We'll come to that, if you fix the seed constant. – deepseefan Aug 26 '19 at 00:51
  • 1
    I've removed second set.seed for convenience and updated the output and it's not, would you please check on your side. – George Shimanovsky Aug 26 '19 at 00:55
  • Now change the `rbind` in your `do.call` to `cbind` and see if it gives you what you want. – deepseefan Aug 26 '19 at 00:56
  • It seems you didn't catch the problem. Actual Revenue for the second case is 820000 since the sample size increase 1000 times by rbind replication – George Shimanovsky Aug 26 '19 at 01:01
  • aren't we talking about `binomial distributions` (probability distribution)? Why is the actual revenue for the second case `820000`? – deepseefan Aug 26 '19 at 01:08
  • print(actual2 <- sum(df1000$price[df1000$sale])) Sale column Indicates Actual Outcome (TRUE/FALSE). If you have any thoughts on how to simulate revenue distribution upon mentioned above would you please share in the answer? – George Shimanovsky Aug 26 '19 at 01:09
  • Is it a typo or actually `Bernulli` in the title or `Bernoulli`? – deepseefan Aug 26 '19 at 03:57
  • It is. fixed, thanks – George Shimanovsky Aug 26 '19 at 07:05
  • I kind of figured out issue with your current approach. `temp <- bernoulli(df1000)`. You need to take `rowSums` and `quantile` of every 5 columns as they are one group. Do `lapply(seq(1, ncol(temp), 5), function(x) quantile(rowSums(temp[, x:(x + 4)])))` and they follow the same distribution of `quantile`. – Ronak Shah Aug 27 '19 at 03:06

1 Answers1

0

I needed a space to clarify my comment which says change rbind to cbind in your do.call. Here it is and why I said that.

set.seed(100)
df <- data.frame(price        = c(200, 100, 600, 20, 100),
                 est_p        = c(0.9, 0.2, 0.8, 0.5, 0.6),
                 actual_sale  = c(FALSE, TRUE, TRUE, TRUE, TRUE))

print(actual1 <- sum(df$price[df$actual_sale])) # Actual Revenue

[1] 820

# here is where you need to change the rbind to cbind to stay within the range 
# otherwise you're essentially changing the distribution of the data and you 
# can't compare the results 
df1000 <- do.call("cbind", replicate(1000, df, simplify = FALSE))
print(actual2 <- sum(df1000$price[df1000$actual_sale])) 
[1] 820

Here is the simulated distribution, an rbind distribution and cbind distribution for you to have a visual. As you can see the simulated and cbind are the same. The rbind produced a different distribution. The quantile() or fivenum() are drawn from the distribution. That is why you're getting a different number.

binomial_out

Hope this helps trace the reason behind why or from where the quantile() is getting the numbers.

deepseefan
  • 3,701
  • 3
  • 18
  • 31