1

I am working on a forecasting model for monthly data which I intend to use in SQL server 2016 (in-database).

I created a simple TBATS model for testing:

dataset <- msts(data = dataset[,3],
            start = c(as.numeric(dataset[1,1]),
                      as.numeric(dataset[1,2])), 
            seasonal.periods = c(1,12))

dataset <- tsclean(dataset,
       replace.missing = TRUE,
       lambda = BoxCox.lambda(dataset, 
                              method = "loglik", 
                              lower = -2, 
                              upper = 1))

dataset <- tbats(dataset,
                 use.arma.errors = TRUE,
                 use.parallel = TRUE,
                 num.cores = NULL
                 )

dataset <- forecast(dataset, 
                    level =c (80,95),
                    h = 24)

dataset <- as.data.frame(dataset)

Dataset was imported from .csv file I created with SQL query.

Later, I used same code in SQL server, input being the same query I used for .csv file (meaning data was exactly the same aswell)

However, when I executed the script, I noticed I got different results. All numbers look fine and make perfect sense, both SQL and standalone R give a forecast table, but all numbers between two tables differ for few % (about 3% on average).

Is there an explanation for this? It really bothers me as I need best possible results.

EDIT: This is how my data looks for easier understanding. It's basically 3 column table: year, month, value of transactions (numbers are randomised because data is classified). Alltogether I have data for 9 years.

2008    11  1093747561919.38
2008    12  816860005030.31
2009    1   341394536377.06
2009    2   669993867646.25
2009    3   717585597605.75
2009    4   627553319006.03
2009    5   984146176491.78
2009    6   605488762214.33
2009    7   355366795222.40
2009    8   549252969698.07
2009    9   598237364101.23

This is an example of results. Top two rows are from SQL server, bottom two rows are from RStudio.

t    Point            Lo80            Hi80
1    872379.7412      557105.271      1187654.211
2    1093817.266      778527.1078     1409107.424

1    806050.6884      517606.464      1094494.913
2    1031845.483      743387.015      1320303.95

EDIT 2: I checked each part of code carefully and I figured out that difference in results happens at TBATS model.

SQL server returns: TBATS(0.684, {0,0}, -, {<12,5>})

RStudio returns: TBATS(0.463, {0,0}, -, {<12,5>})

This explains difference in forecast values, but the question remains as these should be the same.

  • you need to paste a few rows of records for us to understand your data – Ajay Ohri Jul 26 '17 at 09:46
  • I've added data sample, sorry for this mistake. – Kristijan Tornič Jul 26 '17 at 10:23
  • What does comma signify in third column – Ajay Ohri Jul 26 '17 at 12:49
  • just decimals, it's full stop in original data but it changed it to comma when I coppied it from csv for some reason. I wil fix it to prevent future confusion. – Kristijan Tornič Jul 26 '17 at 14:02
  • When you test in RStudio @KristijanTornič, which R runtime do you run against; Microsoft R Server, Microsoft R Open, or CRAN R? If you use CRAN R, could you try and point RStudio against the SQL Server R runtime, and see what your results are. – Niels Berglund Jul 27 '17 at 02:09
  • @Niels Berglund Good idea, I was runing on server runtime since begining. I did try to run other runtimes and I got different results each time. Non of the results match exact results from server, but it is a good starting point and may be a solution for someone else. Thank you. I also tried same runtime same program on my PC and on server and I got different results aswell. Seems like servers handle these things a bit differently. I guess I'll just have to accept it. – Kristijan Tornič Jul 27 '17 at 07:09

1 Answers1

0

I'll answer this for anyone having problems in the future:

Seems like there is an difference in execution in R engine depending on you OS and runtime. I tested this by runing standalone R on my PC and on server using RStudio and Microsoft R Open and runing R in database on my PC and on server. I also tested all different runtimes.

If anyone wants to test themseves, R runtime can be changed in Tools - Global Options - General - R version (for RStudio)

All tests returned a slightly different results. This does not mean the results were wrong (in my case at least, as I'm forecasting for real business data and results have wide intervals anyway).

This may not be an actual solution, but I hope I can prevent someone panicking for a week like I did.