I am new to Python. I apologize in advance if my code feels cringy. For the past few weeks I have been working on a difficult statistical challenge called deductible modeling. I will try avoiding unnecessary statistical jargon and keeping the problem simple, as, as I understanding it, my issue is a programming/optimization one, not a statistical one.
PLEASE MOVE THIS THREAD TO A MORE APPROPRIATE STACK EXCHANGE SITE IF YOU THINK IT SHOULD
I have 8 parameters, 2 of which are constrained and should be positive (phi_freq
and phi_sev
). I am trying to maximize a log-likelihood function which is basically a non-convex, multi-modal, continuous, real-valued, not differentiable (AFAIK) function of all those parameters. Yikes!
I know such problems mean the seed values provided to the search algorithm will have a tremendous impact on the local/global optima we converge to, but as far as I know, my starting values are solid and even-though hardcoded (in the sample provided below), were the result of an auxiliary optimization and not just subjective hunches.
I have tried using the Nelder-Mead
and SLSQP
methods (the SLSQP
is commented out) of the scipy.optimize.minimize
library, but the values returned are awkward and nonsensical.
The following is a MWE:
from scipy.stats import gamma
from scipy import special
import pandas as pd
import numpy as np
import scipy as s
def freq_mean(intercept_freq, beta_veh_age_log, beta_dr_section_b):
return np.exp(intercept_freq + beta_veh_age_log * freq_data['veh_age_log'] + beta_dr_section_b * freq_data['dr_section_b'])
def sev_mean(intercept_sev, alpha_veh_age_log, alpha_acv_log, df):
return np.exp(intercept_sev + alpha_veh_age_log * df['veh_age_log'] + alpha_acv_log * df['acv_log'])
def freq_dist(x, mu, phi):
return (sp.special.gamma(phi + x) / sp.special.gamma(phi) / sp.special.factorial(x)) * ((phi / (phi + mu)) ** phi) * ((mu / (phi + mu)) ** x)
def sev_dist(x, mu, phi, ded):
gamma_pdf = (((phi / mu) ** phi) / sp.special.gamma(phi)) * (x ** (phi - 1.0)) * np.exp(-phi * x / mu)
gamma_sdf = 1.0 - sp.stats.gamma.cdf(x = phi * ded / mu, a = phi, scale = 1.0)
if gamma_sdf == 0.0:
return 0.0
else:
return gamma_pdf / gamma_sdf
def sev_loglik(intercept_sev, alpha_veh_age_log, alpha_acv_log, phi_sev):
sev_data['mu_sev'] = sev_mean(intercept_sev, alpha_veh_age_log, alpha_acv_log, sev_data)
return sev_data.apply(lambda x : np.log(sev_dist(x['claim_amount'], x['mu_sev'], phi_sev, x['deductible'])), axis = 1).sum()
def freq_loglik(intercept_freq, beta_veh_age_log, beta_dr_section_b, phi_freq, intercept_sev, alpha_veh_age_log, alpha_acv_log, phi_sev):
freq_data['mu_sev'] = sev_mean(intercept_sev, alpha_veh_age_log, alpha_acv_log, freq_data)
freq_data['claim_prob'] = 1.0 - sp.stats.gamma.cdf(x = phi_sev * freq_data['deductible'] / freq_data['mu_sev'], a = phi_sev, scale = 1.0)
freq_data['mu_freq'] = freq_mean(intercept_sev, beta_veh_age_log, beta_dr_section_b)
return freq_data.apply(lambda x : np.log(freq_dist(x['claim_count'], x['exposure'] * x['claim_prob'] * x['mu_freq'], x['exposure'] * phi_freq)), axis = 1).sum()
def obj_func(param):
intercept_freq, beta_veh_age_log, beta_dr_section_b, phi_freq, intercept_sev, alpha_veh_age_log, alpha_acv_log, phi_sev = param
ll_sev = sev_loglik(intercept_sev, alpha_veh_age_log, alpha_acv_log, phi_sev)
ll_freq = freq_loglik(intercept_freq, beta_veh_age_log, beta_dr_section_b, phi_freq, intercept_sev, alpha_veh_age_log, alpha_acv_log, phi_sev)
return -(ll_freq + ll_sev)
def mle(nfev = 100):
intercept_freq, beta_veh_age_log, beta_dr_section_b, phi_freq, intercept_sev, alpha_veh_age_log, alpha_acv_log, phi_sev = (-1.87515443, -0.5675389200, -0.0216802900, 23.78568667, 10.42040743, 0.00465891, 0.00072216, 0.69497075)
seeds = np.array([intercept_freq, beta_veh_age_log, beta_dr_section_b, phi_freq, intercept_sev, alpha_veh_age_log, alpha_acv_log, phi_sev])
return sp.optimize.minimize(fun = obj_func, x0 = seeds, method = 'Nelder-Mead', options = {'maxfev' : nfev})
#cons = ({'type': 'ineq', 'fun' : lambda x : x[3]}, {'type': 'ineq', 'fun' : lambda x : x[7]})
#return sp.optimize.minimize(fun = obj_func, x0 = seeds, method = 'SLSQP', constraints = cons)
policies = pd.read_csv('C:/policies.txt', sep = '\t')
claims = pd.read_csv('C:/claims.txt', sep = '\t')
freq_data = pd.merge(left = policies, right = claims.groupby('ID').agg(claim_count = ('claim_amount', 'count')), how = 'left', on = 'ID')
freq_data['claim_count'].fillna(0, inplace = True)
sev_data = pd.merge(left = claims[['ID', 'claim_amount']], right = policies.drop(['dr_section_b', 'exposure'], axis = 1), how = 'left', on = 'ID')
opt = mle(4000)
I am unsure how and if providing data is necessary at the moment (the policies
and claims
flat files referenced in the sample code). I am prepared to do so, but would need anonymizing them first.
So, I guess at this point any pointers would be welcomed, as I am close to hitting a wall. There is a global solution, there must be, meaning the mle
(maximum likelihood estimators) must exist. My seed values are very realistic, as they were obtained through moment matching (the so-called method of moments
estimators). I feel there must be something else I'm doing wrong. Also, as lame as that may sound, I even reproduced the exact same problem using Excel
's solver, and met similar numerical convergence issues. I am happy to provide supplementary details, technical or not, regarding this problem.