For an auctions website I run, I'm aiming to find which features have the highest influence on bids received. This way, I can focus my energies on improving features that matter the most.
I've been advised to run a Poisson Regression analysis for this purpose. This question is about getting the data ready for regression, and then running the actual regression. I'm using Python for this purpose.
The data: The dataset comprises auctions that lived for precisely 7 days. There is a mix of continuous and categorical features. Continuous ones are asking_price
, description_char_count
and num_of_photos
.
Categorical variables are city
, item_category
and item_condition
.
The dependent variable is net_unique_bids
.
How do I handle the categorical variables?
Dummy variables: Correct me if I'm wrong - but I think I need to do the following:
# convert categorical columns
cities = pd.get_dummies(df['city'], drop_first=True)
categ = pd.get_dummies(df['item_category'], drop_first=True)
cond = pd.get_dummies(df['item_condition'], drop_first=True)
# add to main dataframe 'df'
df = pd.concat([df,cities,categ, cond], axis=1)
# remove original categorical columns
df.drop('city',axis=1, inplace=True)
df.drop('item_category',axis=1, inplace=True)
df.drop('item_condition',axis=1, inplace=True)
Running Poisson Regression: If this is correct so far, the next steps entail:
from statsmodels.genmod.generalized_estimating_equations import GEE
from statsmodels.genmod.cov_struct import (Exchangeable,
Independence,Autoregressive)
from statsmodels.genmod.families import Poisson
f1 = "net_unique_bids ~ city1 + city2 + city3 + city4 + item_category1 + item_category2 + item_category3 + item_condition1 + item_condition2 + item_condition3 + asking_price + description_char_count + num_of_photos"
model1 = GEE.from_formula(formula=f1, data=df, cov_struct=Independence(), family=Poisson())
Do I have the right idea around how to handle categorical variables? Am I running Poission Regression correctly (and have I formulated f1
correctly as well)?
If not, help me fill out the gaps.
Note: I got my guidance on Poisson Regression in Python from here.