1

I want to understand why I am not getting a probability distribution when I use a simulation from a random normal distribution:

library(tidyverse)
df <- mtcars # data

df$sd <- sd(df$mpg) # standard deviation of the sample

set.seed(123)
f <- function(n1, s1, n2, s2){
  mean(rnorm(10000, n1, s1) < rnorm(10000, n2, s2)) # function for probability distribution
  
}

g <- Vectorize(f, c("n1", "s1", "n2", "s2")) 
set.seed(123)
res <- outer(df$mpg, df$sd, df$mpg, df$sd, FUN = g)
dimnames(res) <- list(row.names(df), row.names(df))
res <- data.frame(res)
res <- tibble::rownames_to_column(res, 'p1')

datalong_2 <- tidyr::gather(res, 'p2', 'value', 2:33) # output

I did this simulation but for some reason, I am not getting an actual probability distribution, my goal is to evaluate the probability of a car has less mpg than another car. But the sum of the probability does not add to one. I expect that this can be added to one or lower given that a tight might happens.

For example, the probability that Mazda Rx4 has a lower mpg than Mazda Rx4 wag is 0.5094 while the probability that Mazda Rx4 wag has a lower mpg than Mazda Rx4 is 0.5029, the sum of this probability is 1.0123. How can I change this code to get an actual probability distribution of one car has lower mpg than another car?

Pastor Soto
  • 336
  • 2
  • 14

1 Answers1

1

Unless you absolutely have to run simulations, you can use the pnorm() function to calculate the probabilities precisely.

We assume that X~N(u1,s1) and Y~N(u2,s2) where s1 and s2 are variances.

Also we know that P(X<Y) = P(X-Y<0), where X-Y ~ N(u1-u2,s1+s2). From this, we can calculate the probabilities precisely:

df <- mtcars # data
df$sd <- sd(df$mpg) # standard deviation of the sample

f <- function(n1, n2){
  pnorm(0, mean = n1 - n2, sd = sqrt(2*df$sd^2))
}

res <- outer(X = df$mpg, Y = df$mpg, FUN = f)
dimnames(res) <- list(row.names(df), row.names(df))
res <- data.frame(res)
res <- tibble::rownames_to_column(res, 'p1')

datalong_2 <- tidyr::gather(res, 'p2', 'value', 2:33) # output

> datalong_2
                     p1                p2      value
1             Mazda RX4         Mazda.RX4 0.50000000
2         Mazda RX4 Wag         Mazda.RX4 0.50000000
3            Datsun 710         Mazda.RX4 0.41637203
4        Hornet 4 Drive         Mazda.RX4 0.48128464
5     Hornet Sportabout         Mazda.RX4 0.60636049
..                   ..                ..         ..

Also, I think your main problem was in the function outer(), which requires 2 inputs X and Y. It worked for me once I changed it.



Edits 2 & 3:

df1 <- mtcars; df1$rownames = rownames(df1)
df2 <- mtcars; df2$rownames = rownames(df2)
df2$mpg = df2$mpg + rnorm(nrow(df2),0,3)
data = rbind(df1, df2)


df = ddply(data,~rownames,summarise,mean=mean(mpg),sd=sd(mpg))
df = rbind(df, c("car1",-1.02, 2.66))
df = rbind(df, c("car2",0.13, 0.06))
df$mean <- as.numeric(df$mean)
df$sd <- as.numeric(df$sd)

f <- function(x, y){
  n1 = df$mean[x]; n2 = df$mean[y]; sd1 = df$sd[x]; sd2 = df$sd[y]
  pnorm(0, mean = n1 - n2, sd = sqrt(sd1^2 + sd2^2))
}

res <- outer(X = 1:nrow(df), Y = 1:nrow(df), f)
dimnames(res) <- list(df$rownames, df$rownames)
res <- data.frame(res)
res <- tibble::rownames_to_column(res, 'p1')

datalong_2 <- tidyr::gather(res, 'p2', 'value', -1) # output

subset(datalong_2, p1 %in% c("car1","car2") & p2 %in% c("car1","car2"))

> subset(datalong_2, p1 %in% c("car1","car2") & p2 %in% c("car1","car2"))
       p1   p2     value
1121 car1 car1 0.5000000
1122 car2 car1 0.3327904
1155 car1 car2 0.6672096
1156 car2 car2 0.5000000
VitaminB16
  • 1,174
  • 1
  • 3
  • 17
  • 1
    Thanks. Can you explain to me your formula ``sqrt(2*df$sd^2)``? I don't understand why is different from the variance formula – Pastor Soto May 03 '21 at 17:23
  • 1
    In `X-Y ~ N(u1-u2,s1+s2)`, `s1` and `s2` are variances, so the standard deviation of `X-Y` is `sqrt(s1+s2)`. The standard deviation for the set is `df$sd` and it is a single number, so `s1`=`s2`. Then the variance of `X-Y` is `2*df$sd^2`, meaning that SD is the square root of that. – VitaminB16 May 03 '21 at 17:26
  • I have a problem. My real dataset actually has a different standard deviation for each car. Is there a way in which I can make work out this problem with different values of the standard deviation for each car? – Pastor Soto May 03 '21 at 19:09
  • So you have several observations of `mpg` for each car? – VitaminB16 May 03 '21 at 19:15
  • 1
    Yes. I only use mtcars for simplicity. These are scores of golf players and I have data for players each year. So I have the mean score and the standard deviation of the score of each player. We can think that we have several mpg measures and I get the mean and the standard deviation of each car – Pastor Soto May 03 '21 at 19:19
  • Thank you!! I appreciate your help. I have one last problem that I am not sure how to solve. ``df = rbind(df, c("car1",-1.02, 2.66)); df = rbind(df, c("car2",0.13, 0.06)); df$mean <- as.numeric(df$mean); df$sd <- as.numeric(df$sd)`` I add two observations, the problem is that car 1 has 0.62 of having lowe mean than car 2 but car 2 has 0 probability of having lower mpg than car 1. Any idea why? – Pastor Soto May 03 '21 at 21:26
  • I'm not sure what you've done, but when I did it, everything is okay. See another edit :) – VitaminB16 May 03 '21 at 21:38
  • Yes!! Car 2 has a probability of having a lower mpg than Car 1 of 3.8e-42 which is almost 0. While Car1 has a probability of having a lower mpg than Car2 of 6.2e-01, which is 0.62 – Pastor Soto May 03 '21 at 21:42
  • 1
    My bad, I found the mistake. From the beginning, the right thing to do was to pass just the indices `X=1:nrow(df)` and `Y=1:nrow(df)` into `f` and do all the operations inside the function `f`. Edited the answer. – VitaminB16 May 03 '21 at 22:20
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/231904/discussion-between-pastor-soto-and-vitaminb16). – Pastor Soto May 04 '21 at 02:20