-1

I'm trying to multiply some probability functions as to update the probability given certain factors. I've tried several things using the pdqr and bayesmeta packages, but they all work out not the way I intend, what am I missing?

A reproducible example showing two different distributions, a and b, which I want to multiply. That is because, as you notice, b doesn't have measurements in the low values, so a probability of 0. This should be reflected in the updated distribution.

library(tidyverse)
library(pdqr)
library(bayesmeta)

#measurements
a <- c(1, 2, 2, 4, 5, 5, 6, 6, 7, 7, 7, 8, 7, 8, 2, 6, 9, 10)
b <- c(5, 6, 6, 6, 7, 7, 7, 7, 7, 8, 8, 8, 9, 9, 9, 7)

#create probability distribution functions
distr_a <- new_d(a, type = "continuous")
distr_b <- new_d(b, type = "continuous")

#try to combine distributions
summarized <- distr_a + distr_b
multiplied <- distr_a * distr_b
mixture <- form_mix(list(distr_a, distr_b))
convolution <- convolve(distr_a, distr_b)

The resulting PDF's are plotted like this:

enter image description here

The bayesmeta::convolve() does the same as summarizing two pdqr PDF's and seem to oddly shift the distributions to the right and make them not as high as supposed to be. Ordinarily multiplying the pdqr PDF's leaves a very low probablity overall. Using the pdqr::form_mix() seems to even the PDF's out in between, but leaving probabilies above 0 for the lower x-values.

So, I tried to gain some insight in what I wanted to do, by using the PDF's for a and b to generate probabilities for each x value and multiply that:

#multiply distributions manually
x <- c(1:10)
manual <- data.frame(x) %>%
  mutate(a = distr_a(x),
         b = distr_b(x),
         multiplied = a*b)

This indeed gives a resulting shape I am after, it however (logically) has too low probabilities: enter image description here

I would like to multiply (multiple) PDF's. What am I doing wrong? Are my statistics wrong, or am I missing a usefull function?

UPDATE: It seems I am a stats noob on this subject, but I would like to achieve something like the below distribution. Given that both situation a and b are true, I would expect the distribution te be something like the dotted line. Is that possible? enter image description here

Johan Vos
  • 65
  • 5
  • 1
    You are misunderstanding the meaning of the operations in `pdqr`. `distr_a * distr_b` is not the product of the densities, it is the density of the product of independent realizations of the random variables. Similarly, `distr_a + distr_b` and `convolve` give the density of the sum of independent realizations, not the sum of the densities. – user2554330 Aug 06 '22 at 09:00
  • @user2554330, thanks for the feedback, in your opinion: what would be a suitable `pdqr` approach? – Johan Vos Aug 06 '22 at 09:57
  • I don't know what would be suitable, because you haven't really stated what you are doing this for. But if you want the product of two densities, just do it as `prod <- function(x) distr_a(x) * distr_b(x)`. What you get is not a density function, but it is a function that is the product of two densities. – user2554330 Aug 06 '22 at 17:07

2 Answers2

1

multiplied is the correct one. One can check with log-normal distributions. The sum of two independant log-normal random variables is log-normal with µ = µ_a + µ_b and sigma² = sigma²_a + sigma²_b.

a <- rlnorm(25000, meanlog = 0, sdlog = 1)
b <- rlnorm(25000, meanlog = 1, sdlog = 1)

distr_a <- new_d(a, type = "continuous")
distr_b <- new_d(b, type = "continuous")

distr_ab <- form_trans(
    list(distr_a, distr_b), trans = function(x, y) x*y
)
# or: distr_ab <- distr_a * distr_b

plot(distr_ab, xlim = c(0, 40))
curve(dlnorm(x, meanlog = 1, sdlog = sqrt(2)), add = TRUE, col = "red")

enter image description here

Stéphane Laurent
  • 75,186
  • 15
  • 119
  • 225
  • `distr_ab <- distr_a * distr_b` is my `multiplied` which has a too low PDF – Johan Vos Aug 06 '22 at 09:53
  • But it's correct :-) – Stéphane Laurent Aug 06 '22 at 10:11
  • Could you then please explain to me, why it shows probabilities for x-values higher than 10 in my example? – Johan Vos Aug 06 '22 at 10:20
  • You have some values a=10 and b=7, and 10*7=? So why would you not have values higher than 10? You seem to be totally confused. – Stéphane Laurent Aug 06 '22 at 10:24
  • Ah I read a comment above You don't understand that the density of a product is not the product of the densities... – Stéphane Laurent Aug 06 '22 at 10:26
  • it indeed seems like I am misunderstanding some of the key underlying aspects of the updating of PDF's, I've added an intended PDF in the original question. Perhaps I am asking the wrong question at all... – Johan Vos Aug 06 '22 at 12:50
  • @JohanVos The random variable **a** has a density **p_a**, the random variable **b** has a density **p_b**. Then (assuming independence), the random variable **ab** has a density, but this density is not the product of **p_a** with **p_b**. This product is even not a density, it has no mathematical interpretation. If you want the density of **ab**, then I showed you the way. – Stéphane Laurent Aug 06 '22 at 15:13
-1

As demonstrated here:

https://www.r-bloggers.com/2019/05/bayesian-models-in-r-2/

# Example distributions
probs <- seq(0,1,length.out= 100)
prior  <- dbinom(x = 8, prob = probs, size = 10)
lik <- dnorm(x = probs, mean = .5, sd = .1)

# Multiply distributions
unstdPost <- lik * prior

# If you wanted to get an actual posterior, it must be a probability 
# distribution (integrate to 1), so we can divide by the sum:
stdPost <- unstdPost / sum(unstdPost)

# Plot
plot(probs,  prior, col = "black", # rescaled
     type = "l", xlab = "P(Black)", ylab = "Density")
lines(probs, lik / 15, col = "red")
lines(probs, unstdPost, col = "green")
lines(probs, stdPost, col = "blue")
legend("topleft", legend = c("Lik", "Prior", "Unstd Post", "Post"),
       text.col = 1:4, bty = "n")

Created on 2022-08-06 by the reprex package (v2.0.1)

socialscientist
  • 3,759
  • 5
  • 23
  • 58
  • Your standardization is wrong. The sum of density values is different from the integral. You need to multiply `sum(unstdPost)` by the difference in the x values, i.e. 0.01 in your example, which will make the posterior density 100 times higher. – user2554330 Aug 06 '22 at 09:30
  • The marginal likelihood takes the sum for discrete random variables. Values are drawn from a continuous distribution but could still be drawn from a discrete distribution if we were given them without knowledge of the underlying distribution (as in the OP's case). More useful model would presumably be continuous though. – socialscientist Aug 06 '22 at 09:49
  • In my example using `distr_a * distr_b` doesn't work out the way in your example. Could you use my distributions in your example? – Johan Vos Aug 06 '22 at 09:56
  • @socialscientist: The random variable in your posterior is `probs` not `x`, and it is continuous on the [0,1] interval. – user2554330 Aug 06 '22 at 09:58
  • Just noticed: you appear to have prior and likelihood reversed. `dbinom(8, prob = probs, size = 10)` makes sense as a likelihood, but you have named it `prior`; `dnorm(x = probs, mean = .5, sd = .1)` makes sense as an approximation to a prior, but you have named it `lik`. This doesn't matter for the posterior calculation since you just multiply them and normalize, but it will confuse people. – user2554330 Aug 06 '22 at 11:32