0

I'm conducting a case study where I have to predict claim number per policy. Since my variable ClaimNb is not binary I can't use logistic Regression but I have to use Poisson. My code for GLM model:


 import statsmodels.api as sm
  
  import statsmodels.formula.api as smf
  
  formula= 'ClaimNb ~ BonusMalus+VehAge+Freq+VehGas+Exposure+VehPower+Density+DrivAge'
  
  model = smf.glm(formula = formula, data=df,
  family=sm.families.Poisson()) 

I have also split my data


   # train-test-split   
   train , test = train_test_split(data,test_size=0.2,random_state=0)
   
   # seperate the target and independent variable
   train_x = train.drop(columns=['ClaimNb'],axis=1)
   train_y = train['ClaimNb']
   
   test_x = test.drop(columns=['ClaimNb'],axis=1)
   test_y = test['ClaimNb'] 

My problem now is the prediction, I have used the following but did not work:

    from sklearn.linear_model import PoissonRegressor model = PoissonRegressor(alpha=1e-3, max_iter=1000)
    
    model.fit(train_x,train_y)
    
    predict = model.predict(test_x)

Please is there any other way to predict and check the accuracy of the model?

thanks

Patrik
  • 499
  • 1
  • 7
  • 24
L200
  • 11
  • 1
  • 5

1 Answers1

1

You need to assign the model.fit() and predict with that, it's different from sklearn. Also, if you using the formula, it is better to split your dataframe into train and test, predict using that. For example:

import statsmodels.api as sm
import statsmodels.formula.api as smf
import numpy as np
import pandas as pd

df = pd.DataFrame(np.random.randint(0,100,(50,4)),columns=['ClaimNb','BonusMalus','VehAge','Freq'])
#X = df[['BonusMalus','VehAge','Freq']]
#y = df['ClaimNb']

df_train = df.sample(round(len(df)*0.8))
df_test = df.drop(df_train.index)

formula= 'ClaimNb ~ BonusMalus+VehAge+Freq'
  
model = smf.glm(formula = formula, data=df,family=sm.families.Poisson()) 
result = model.fit()

And we can do the prediction:

result.predict(df_train)

Or:

result.predict(df_test)
StupidWolf
  • 45,075
  • 17
  • 40
  • 72
  • Thank you . So if I understand it's not like R where you can have a new columns with the claim_estimated? – L200 Nov 28 '20 at 19:30
  • you can add it to your test dataframe, e.g ```df_test['pred'] = result.predict(df_test)``` – StupidWolf Nov 28 '20 at 19:35
  • and how to group it by policy? – L200 Nov 29 '20 at 13:41
  • i am not sure what you mean. seems like a new question. can you post it outside of this? – StupidWolf Nov 29 '20 at 13:49
  • actually it's the same, as per my title predicting number of claim per policy . once we have the predicted variable and we add them to the data frame it should be group by policy for example policy 1: claimnb= 2 predictedclaim= 1 – L200 Nov 29 '20 at 14:10