I would ask for a little help with my problem in finding an optimization procedure. I have to explain certain steps of my work and I hope I can explain my problem clearly... Currently I am conducting research on missing data in R working with simulated data and real data. In the simulated data I have construed missing data patterns similiar to the real data by some learning techniques. What I now want to do is some learning from the simulation to real data.
In the simulation I have a true parameter vector with parameters of a regression model estimated on the full data without missing. Let us call this vector REF. With artificially generating missings and multiple imputation (e.g. with MICE) of the missing values I finally receive the parameters of the regression model based on missing data. Let us call these regression parameters MICE. I now want to model the difference of REF and MICE, let us call them PA-REF und put these learning to the real data. A fairly good modeling for PA-REF is possible with an extreme value distribution. This means, generally MICE is mostly near to REF, but in some cases not. Thus, for the sake of construing a distribution for estimated regression model parameters in +++ the real data (!) with missings and plug in imputations via Mice+++ I get a very broad distribution with the extreme value around the regression parameters. For these circumstances I estimate a logit with some covariates obtaining probabilities if the regression parameter is an outlier or not (in terms of the difference REF and MICE in the simulation). I try to govern the EVD-Parameters with these probabilities using approxfun.
In the code below you find an implementation of this idea.
At the beginning a very basic function for the evd-parameter dependency to the outlier probability takes place: Simply a linear function with a start and end value. My starting point is a discrimination between low and high values of PA.Ref. For both group I plugged in an outlier. For the low values the cutoff for outliers (10), for the high values the lowest value of the low value group. I do this to include unsecurity about the possibility that the estimated probabilities reflect uncertainty of a MICE based estimation of regression parameters to be an outlier or not. In the simulation the model for predicting a parameter to be an outlier (high PA-Ref) works faily well, but there are some shortcomings, which are not necessary to explain here.
In the next step I use this function to draw values from the thereby governed EVD, i.e. 125 values for PA-REF with the simple linear function for the evd-parameters as a function of the outlier probabilities. The draws have to meet the requirement to be higher than 0 and not higher than 1.5 max of PA-REF. The governed evd draws yield a distribution for every regression parameter. Here I obtain a matrix-format 125x44, which rows could be doubled due to the unknown direction of PA-REF. Again I approxfun this evd-distribution and can hereby calculate the pdf value for the observed PA-REF (difference REF and MICE) in the simulation, see next step.
The third step do basically this for every element of y.pa.ref: Approximate the density of the evd draws and integrate this approximated function to the observed y.pa.ref, while meeting some requirements (start.val, end.val, max.val) for getting the probability mass of the coverage for the observed PA-REF. Finally a mean is a measure of how well the evd covers the PA-Ref values on average. Here it is 0.33, meaning that via this approach it is the average probability, that the y.pa.ref are covered in the extr. value distribution specified for every parameter of the regression model. (If You consider the direction, the value could be doubled.)
What I want to opimize is the following: I now want to maximize the mean of the pdf-values for all elements in the observed PA-REF vector. So my question is: Is it possible to optimize these value by specifying an optimized function for the parameters of the EVD?
I tried several ideas on my own. For example I used the pairs of highest prob.outl and y.pa.ref, draw values from the multivariate normal and use these values for an higher estimation of the evd-parameters. Then I build a function, with linear increase of the evd parameters in outl.prob 0 to 0.5 and with constant evd parameters of the higher MVN based EVD parameters with above 0.5 to 1. Unfortunately, this does not improve the mean of pdf mass at all. Basically I believe that it is possible to solve this constellation via optimization. But maybe I am wrong...? :-))
THX for your support. If you are willing to, I will reference your help in my publication, at least I have to by anonymous!
PA.REF <- c(0.43997, 1.99193, 2.36845, 1.37931, 24.32432, 0.42373, 2.77907,
2.49998, 18.18181, 0.52219, 0.37312, 5.24694, 4.21052, 0.59523,
5.02122, 4.78839, 3.11992, 4.42476, 16.09194, 3.48839, 3.24674,
0.39386, 3.74449, 6.20155, 2.02429, 2.02953, 2.77776, 2.22222,
7.34693, 5.15163, 1.57566, 0.16821, 3.20001, 1.17646, 0.83331,
0.55556, 1.88996, 0.38461, 4.76189, 0.91186, 1.0118, 0.29985,
1.34529, 13.58024)
Outl.prob <- c(0.15371, 0.02757, 0.01265, 0.05866, 0.89814, 0.01828, 0.00442,
0.03018, 0.63696, 0.00845, 0.01356, 0.00507, 0.07715, 0.07176,
0.02763, 0.00671, 0.00998, 0.03585, 0.01629, 0.16794, 0.09089,
0.00406, 0.01003, 0.03254, 0.01134, 0.00711, 0.00424, 0.00589,
0.09664, 0.01311, 0.00851, 0.01428, 0.01049, 0.25723, 0.0257,
0.06623, 0.00728, 0.00789, 1, 0.00587, 0.01292, 0.00889, 0.0179,
1)
my.data <- data.frame(Outl.prob, PA.REF)
y.pa.ref <- my.data[,c("PA.REF")]
outl.prob <- my.data[,c("Outl.prob")]
cut.outl <- 10
require(extRemes)
###########################
#### Step 1
fit.evd.min <- fevd( c( sort(y.pa.ref)[1:(length(y.pa.ref)/2)],
cut.outl))
loc.evd.fit.min = fit.evd.min$results$par[c("location")]
scale.evd.fit.min = fit.evd.min$results$par[c("scale")]
shape.evd.fit.min = fit.evd.min$results$par[c("shape")]
fit.evd.max <- fevd( c( sort(y.pa.ref)[(length(y.pa.ref)/2): length(y.pa.ref)],
min(y.pa.ref)))
loc.evd.fit.max = fit.evd.max$results$par[c("location")]
scale.evd.fit.max = fit.evd.max$results$par[c("scale")]
shape.evd.fit.max = fit.evd.max$results$par[c("shape")]
basic.loc.evd.fun <- approxfun(approx(x= c(0,1),
y= c(loc.evd.fit.min, loc.evd.fit.max), n=1e4))
basic.scale.evd.fun <- approxfun(approx(x= c(0,1),
y=c(scale.evd.fit.min, scale.evd.fit.max), n=1e4))
basic.shape.evd.fun <- approxfun(approx(x= c(0,1),
y=c(shape.evd.fit.min, shape.evd.fit.max), n=1e4))
###########################
###########################
#### Step 2
number125 <- 1:125
evd.val.max <- max(y.pa.ref)*1.5
count.draw.simu <- 0
evd.var.vert.simu <- NULL
matrix.evd.v4 <- matrix(nrow=length(number125),
ncol=length(y.pa.ref))
set.seed(71108)
for (i in c(1:length(y.pa.ref))){
while (count.draw.simu < max(number125)){
evd.simu.draw <- rgev(1, loc=basic.loc.evd.fun(outl.prob[i]),
scale= basic.scale.evd.fun(outl.prob[i]),
shape=basic.shape.evd.fun(outl.prob[i]))
evd.simu.draw <- ifelse(evd.simu.draw<evd.val.max
& evd.simu.draw >0, evd.simu.draw, NA)
evd.var.vert.simu <- c(evd.var.vert.simu, evd.simu.draw)
evd.var.vert.simu <- evd.var.vert.simu[!is.na(evd.var.vert.simu)]
count.draw.simu <- length(evd.var.vert.simu)
}
matrix.evd.v4[,i] <- evd.var.vert.simu
evd.var.vert.simu <- NULL
count.draw.simu<- 0
}
EVD2 <- rbind(matrix.evd.v4, - matrix.evd.v4)
###########################
###########################
#### Step 3
fun.list.EVD2 <- list()
est.val.fun.EVD2 <- list()
est.prob.fun.EVD2 <- NULL
for (i in 1:length(y.pa.ref)){
EVD2.val <- EVD2[,i]
assign(paste0(i, ".fun.EVD2" ), approxfun( density(EVD2.val) ))
fun.list.EVD2[[i]] <- get(paste0(i,".fun.EVD2" ))
start.val <- min( density(EVD2.val)$x)
end.val <- y.pa.ref[i]
max.val <- max( density(EVD2.val)$x)
ind.fun.ex <- c("Yes")
ifelse( start.val < end.val & max.val > end.val,
ifelse(start.val > 0,
prob.EVD2fun <- integrate( get(paste0(i,".fun.EVD2" )),
start.val,
end.val) ,
prob.EVD2fun <- integrate( get(paste0(i,".fun.EVD2" )),
0,
end.val) ),
ind.fun.ex <- c("NO") )
if(ind.fun.ex == c("NO")){
est.val.fun.EVD2[[i]] <- NULL
est.prob.fun.EVD2 <- c(est.prob.fun.EVD2, 0)
} else {
assign(paste0("value.EVD2.fun", i), prob.EVD2fun)
est.val.fun.EVD2[[i]] <- get(paste0("value.EVD2.fun", i))
est.prob.fun.EVD2 <- c(est.prob.fun.EVD2, prob.EVD2fun$value)
}
}
###########################
optim.val.prob.EVD2 <- mean(est.prob.fun.EVD2)