I am trying to make a prediction on the number of visitors of a website based on historic data collected. I think this is a scenario in which I could use Poisson Regression.
The input consists of 6 columns:
id(the id of the website), day, month, year, day of week, visits.
So basically as input we have a CSV with columns in the format: "2","22", "7", "2015", "6","751".
I am trying to predict the visits based on previous number of visits. The size of the websites can vary, so I ended up dividing them in 5 categories
- almost zero (avg < 1)
- very small(avg < 100)
- small (avg < 1000)
- medium (avg < 50.000)
- big (avg < 500.000)
So I made a 7th column named type which is a int ranging from 1 to 5.
My code is as follows:
train = read.csv("train.csv", header = TRUE)
model<-glm(visits ~ type + day + month + year + dayofweek, train, family=poisson)
summary(model)
P = predict(model, newdata = train)
imp = round(P)
imp
The values predicted are not even close, I taught I could end up with something in 10-20% of the actual values, but failed to do so, most of the values predicted are 200-300% bigger than the actual values. And this is on the train data set, which should provide an optimistic view.
I am new to R and having some problems interpreting the data returned by the summary command. This is what it returns:
Call: glm(formula = visits ~ type + day + month + year + dayofweek, family = poisson, data = train)
Deviance Residuals: Min 1Q Median 3Q Max
-571.05 -44.04 -11.33 -5.14 734.43Coefficients:
Estimate Std. Error z value Pr(>|z|) (Intercept) -9.998e+02 6.810e-01 -1468.19 <2e-16 *** type 2.368e+00 1.280e-04 18498.53 <2e-16 *** day -2.473e-04 6.273e-06 -39.42 <2e-16 *** month 1.658e-02 3.474e-05 477.31 <2e-16 *** year 4.963e-01 3.378e-04 1469.31 <2e-16 *** dayofweek -3.783e-02 2.621e-05 -1443.46 <2e-16 ***
--- Signif. codes: 0 ‘’ 0.001 ‘’ 0.01 ‘’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 1239161821 on 12370 degrees of freedom Residual deviance: 157095033 on 12365 degrees of freedom AIC: 157176273
Number of Fisher Scoring iterations: 5
Could anyone describe in more detail the values returned by the summary command and what should they look like in a Poisson Regression which would output better predictions? Are there any better approaches in R to a data which is based on a evolution over time of the value to be estimated?