0

In a research problem I have real world (RD) and artificial data (AD) for several continuous variables. Both have approximately a normal distribution (with a tail to the right) and similar ranges. I need to transform the artificial data so that it contains the underrepresented values of the real data. I.e. if we assume a perfect normal distribution, then values should be sampled more likely the further away they are from the mean.

My inital attempt in R was to use the probability density function of the RD as a basis. But the more I read about it the more it seems that this is not as trivial as I thought.

My code until now looks like this:

library(dplyr)


df1 <- read.csv("_.csv")

df2 <- read.csv("_.csv")

df2_1 <- df2[c(1:5500), ]
df2_2 <- df2[c(5501:11000), ]
df2_3 <- df2[c(11001:16500), ]
df2_4 <- df2[c(16501:22000), ]

df_active <- df2_1

# Plot the probability density function (PDF) using density plot functions
plot(density(na.omit(df1$..)), main = "Probability Density Function", xlab = "Values", ylab = "Density")

plot(density(na.omit(df_active$..)), main = "Probability Density Function", xlab = "Values", ylab = "Density")


# Step 1: Estimate the probability density function (PDF) using kernel density estimation
pdf_func <- density(na.omit(df1$..))

# Step 2: Calculate the probabilities for each value in the second dataframe

pdf_probabilities <- pnorm(df_active$.., mean = pdf_func$x, sd = pdf_func$bw)

sum(pdf_probabilities)

# Step 3: Use probabilities to determine which values to remove from the second dataframe
sampled_indices <- sample(nrow(df_active), size = sum(pdf_probabilities), replace = TRUE, prob = pdf_probabilities)

# Step 4: Create a new dataframe containing only the remaining values
new_df2_active <- df_active[sampled_indices, ]

# Step 5: Remove sampled rows from df2 dataframe
df3 <- anti_join(df_active, new_df2_active, by = "plot")

df2 is divided into batches of 5500 each in order to match the size of df1.

Result until now: the sampled values look ok, but if I remove them from df2 its distribution is still more or less the same.

I would be grateful for any help on how to solve this problem.

0 Answers0