3

I am trying to make a prediction on the number of visitors of a website based on historic data collected. I think this is a scenario in which I could use Poisson Regression.

The input consists of 6 columns:

id(the id of the website), day, month, year, day of week, visits.

So basically as input we have a CSV with columns in the format: "2","22", "7", "2015", "6","751".

I am trying to predict the visits based on previous number of visits. The size of the websites can vary, so I ended up dividing them in 5 categories

  • almost zero (avg < 1)
  • very small(avg < 100)
  • small (avg < 1000)
  • medium (avg < 50.000)
  • big (avg < 500.000)

So I made a 7th column named type which is a int ranging from 1 to 5.

My code is as follows:

train = read.csv("train.csv", header = TRUE)
model<-glm(visits ~ type + day + month + year + dayofweek, train, family=poisson)
summary(model)
P = predict(model, newdata = train)
imp = round(P)
imp

The values predicted are not even close, I taught I could end up with something in 10-20% of the actual values, but failed to do so, most of the values predicted are 200-300% bigger than the actual values. And this is on the train data set, which should provide an optimistic view.

I am new to R and having some problems interpreting the data returned by the summary command. This is what it returns:

Call: glm(formula = visits ~ type + day + month + year + dayofweek, family = poisson, data = train)

Deviance Residuals: Min 1Q Median 3Q Max
-571.05 -44.04 -11.33 -5.14 734.43

Coefficients:

            Estimate Std. Error  z value Pr(>|z|)     

(Intercept) -9.998e+02  6.810e-01 -1468.19   <2e-16 *** 

type         2.368e+00  1.280e-04 18498.53   <2e-16 *** 

day         -2.473e-04  6.273e-06   -39.42   <2e-16 *** 

month        1.658e-02  3.474e-05   477.31   <2e-16 *** 

year         4.963e-01  3.378e-04  1469.31   <2e-16 *** 

dayofweek   -3.783e-02  2.621e-05 -1443.46   <2e-16 ***

--- Signif. codes: 0 ‘’ 0.001 ‘’ 0.01 ‘’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for poisson family taken to be 1)

Null deviance: 1239161821 on 12370 degrees of freedom Residual deviance: 157095033 on 12365 degrees of freedom AIC: 157176273

Number of Fisher Scoring iterations: 5

Could anyone describe in more detail the values returned by the summary command and what should they look like in a Poisson Regression which would output better predictions? Are there any better approaches in R to a data which is based on a evolution over time of the value to be estimated?

LE. link to train.csv file.

Dragos Geornoiu
  • 527
  • 2
  • 8
  • 17
  • Without your data is basically impossible to help you.. – adaien Apr 10 '16 at 12:10
  • A model with day, month and year could have some interesting behaviour Think about what happens at the transition of one month to the next. Are you sure you want this? – Richard Telford Apr 10 '16 at 12:21
  • @adiana I've added the train.csv file – Dragos Geornoiu Apr 10 '16 at 12:48
  • @RichardTelford I thought that this would be a good way to represent the date value, I think the date should have a big impact on the prediction values, it has to be based on the historical data collected. I would be grateful if you have a better suggestion on how to do it :) – Dragos Geornoiu Apr 10 '16 at 12:53
  • They are not numerical variables. They don't have properties like December = January * 12. They have numbers because you have coded them but they don't mean anything. I'd suggest researching basic variable types and how to add categorical variables to a regression model. – ayhan Apr 10 '16 at 13:08
  • @ayhan Thank you very much, I will do more research . I looked on how to represent date in a regression model and found an example which does something similar to what I was doing, but I understand what you mean. As a first thought, would it be a better solution to represent the first date as 1 and iterate from there? The 50th day would be 50 and so on, so that each day would be the previous one + 1? – Dragos Geornoiu Apr 10 '16 at 13:23
  • That would be your time index, and you can use it yes. But if you also would like to incorporate seasonality then you'll need dummy variables for each month (or week, day of the week -based on your expectations and exploratory analysis). A simple example is here: https://www.otexts.org/fpp/5/2 – ayhan Apr 10 '16 at 13:30

1 Answers1

4

Your problem is with the predict command. The default in predict.glm is to make predictions on the link scale. If you want predictions that you can directly compare with the original data, you need to use the argument type = "response"

P <- predict(model, newdata = train, type = "response")

The model set up is not ideal. Perhaps month should be included as a categorical variable (as.factor) and you need to think more about day (day 31 of month is followed by day 1 of the next month). The predictor "type" is also dubious as type is derived directly from the response.

Your model is also highly over-dispersed. This might indicate missing predictors or other problems.

You should also think about using a mixed effect model.

Richard Telford
  • 9,558
  • 6
  • 38
  • 51
  • If I include a date column representing each date as a integer, from the first historical date, represented as 1, to the last date, would I still need to include month as a categorical variable? And why would the "type" being derived directly from the response be a problem? – Dragos Geornoiu Apr 10 '16 at 15:19
  • You do if you think there might be seasonal effects – Richard Telford Apr 10 '16 at 15:21
  • I will experiment with months and quarters to try to achieve a seasonal effect. I have to do some research on why the type variable should not be derived directly from the response and on mixed effect model. I will accept this answer as it resolved my initial problem of not knowing how to represent the date in a regression model. Thank you! – Dragos Geornoiu Apr 10 '16 at 15:34