0

I would like to create a joint probability distribution by combining two dataframes. Each dataframe contains data drawn from the same population, but the data is not matched. For the sake of providing workable code, imagine that the data is as follows:

v1 <- data.frame(rnorm(100, 0, 3))

v2 <- data.frame(rnorm(30, 10, 20))

In reality I have survey data and simulation data that does not follow a pre-set probability distribution. I am looking for a solution that can combine two vectors of different lengths to create a joint probability distribution.

Dataset v1 represents the distribution of financial returns that can be earned by installing solar panels.

Dataset v2 represents the financial return threshold for households interested in installing solar. A household will only install solar if they live in a home that would meet the threshold they have set in terms of financial return.

Given these two datasets, I'd like to use the joint probability distribution to estimate the likely proportion of households that will adopt and install solar panels.

I've considered running a monte carlo exercise where I would randomly draw from v1 and match it with a draw from v2. I would repeat the process 1000 times and see how many homes would have achieved a return greater than their threshold.

library(tidyverse)
set.seed(1234)

monte = NULL

for (i in 1:1000)
{dat = data.frame()
  draw1 <- sample_n(v1, 1) 
  draw2 <- sample_n(v2, 1) 
  dat = data.frame(draw1,draw2)
  monte = rbind(monte, dat)
}

colnames(monte) <- c("return","threshold")

adoption <- monte %>%
  mutate(total = n()) %>%
  filter(return > threshold) %>%
  summarize(count = n(),
            total=mean(total)) %>%
  mutate(adoption = count/total)

This could work, but I am wondering if there is an alternate way to combine these vectors into a joint probability distribution using R. I would like to be able to generate summary statistics (e.g. proportion of households that would achieve a net return greater than their required threshold), and also visualize the joint distribution in 2-dimensional space.

SolarSon
  • 11
  • 4
  • 3
    I'm not sure I follow. You have two marginal distributions that are not matched, and you want to find the joint distribution? You can't do that unless you either know their joint distribution a priori or you have a dataset with matched data in which case you have an empirical joint distribution. – Migwell Nov 04 '21 at 05:17

1 Answers1

2

The question inherently does not make sense - if the data is not matched you cannot visualize the sampling distribution.

The Monte Carlo exercise you've put together is something akin to a permutation + bootstrap procedure, where you are trying to test against a null hypothesis that there is no relationship between the two variables.

It is not possible to directly calculate a "joint distribution" - the best you can do is simulate draws from the null hypothesis, and conduct subsequent inference. E.g. is the proportion larger than say, 0.5. That is, unless you are willing to go Bayesian.

If you wish to visualize the null distribution (or any joint distribution in general), a scatter or contour plot as usual would work.

monte |>
  ggplot() +
  geom_density_2d(aes(x = return, y = threshold))

monte |>
  ggplot() +
  geom_point(aes(x = return, y = threshold))
zeyuz
  • 86
  • 3
  • I am curious to know what you are thinking when you say 'unless you are willing to go Bayesian'? Could you elaborate? – SolarSon Nov 04 '21 at 20:27
  • A Bayesian approach would provide you a posterior joint distribution - from which you can calculate summary stats etc. However, this is not recommended in this case because you need to specify *how* the two marginal distributions are related to one another, not the mention the priors - **which is precisely what you do not know.** I.e., specify the conditional distribution of returns | threshold, and the conditional distribution of threshold | returns, then sample from each in turn iteratively. – zeyuz Nov 05 '21 at 01:16