1

I have used Microsoft's "Data Science End to End Walkthrough" to set myself up with R Server, and their example works perfectly.

The example (New York taxi data) uses non-categorical variables (ie distance, taxi fare etc.) to predict a categorical variable (1 or 0 for whether or not a tip was paid).

I am trying to predict a similar binary output using categorical variables as an input, using linear regression (the rxLinMod function), and am coming up with an error.

The error says that the number of parameters does not match the number of variables, however it looks to me like the number of variables is actually the number of levels within each factor (variable).

To Replicate

Create a table called example in SQL Server:

USE [my_database];
SET ANSI_NULLS ON;
SET QUOTED_IDENTIFIER ON;
CREATE TABLE [dbo].[example](
    [Person] [nvarchar](max) NULL,
    [City] [nvarchar](max) NULL,
    [Bin] [integer] NULL
) ON [PRIMARY] TEXTIMAGE_ON [PRIMARY];

Put data in it:

insert into [dbo].[example] values ('John','London',0);
insert into [dbo].[example] values ('Paul','New York',0);
insert into [dbo].[example] values ('George','Liverpool',1);
insert into [dbo].[example] values ('Ringo','Paris',1);
insert into [dbo].[example] values ('John','Sydney',1);
insert into [dbo].[example] values ('Paul','Mexico City',1);
insert into [dbo].[example] values ('George','London',1);
insert into [dbo].[example] values ('Ringo','New York',1);
insert into [dbo].[example] values ('John','Liverpool',1);
insert into [dbo].[example] values ('Paul','Paris',0);
insert into [dbo].[example] values ('George','Sydney',0);
insert into [dbo].[example] values ('Ringo','Mexico City',0);

I also use a SQL function which returns variables in table format, as that's what it looks like is required from the Microsoft example. Create the function formatAsTable:

USE [my_database];
SET ANSI_NULLS ON;
SET QUOTED_IDENTIFIER ON;
CREATE FUNCTION [dbo].[formatAsTable] (
@City nvarchar(max)='',
@Person nvarchar(max)='')
RETURNS TABLE
AS
  RETURN
  (
  -- Add the SELECT statement with parameter references here
  SELECT
    @City AS City,
    @Person AS Person
  );

We now have a table with two categorical variables - Person, and City.

Let's start predicting. In R, run the following:

library(RevoScaleR)
# Set up the database connection
connStr <- "Driver=SQL Server;Server=<servername>;Database=<dbname>;Uid=<uid>;Pwd=<password>"
sqlShareDir <- paste("C:\\AllShare\\",Sys.getenv("USERNAME"),sep="")
sqlWait <- TRUE
sqlConsoleOutput <- FALSE
cc <- RxInSqlServer(connectionString = connStr, shareDir = sqlShareDir, 
                    wait = sqlWait, consoleOutput = sqlConsoleOutput)
rxSetComputeContext(cc)
# Set the SQL which gets our data base
sampleDataQuery <- "SELECT * from [dbo].[example] "
# Set up the data source
inDataSource <- RxSqlServerData(sqlQuery = sampleDataQuery, connectionString = connStr, 
                                colClasses = c(City = "factor",Bin="logical",Person="factor"
                                ),
                                rowsPerRead=500)    

Now, set up the linear regression model.

isWonObj <- rxLinMod(Bin ~ City+Person,data = inDataSource)

Look at the model object:

isWonObj

Notice it looks like this:

...
Total independent variables: 11 (Including number dropped: 3)
...

Coefficients:
                           Bin
(Intercept)       6.666667e-01
City=London      -1.666667e-01
City=New York     4.450074e-16
City=Liverpool    3.333333e-01
City=Paris        4.720871e-16
City=Sydney      -1.666667e-01
City=Mexico City       Dropped
Person=John      -1.489756e-16
Person=Paul      -3.333333e-01
Person=George          Dropped
Person=Ringo           Dropped

It says there are 11 variables, which is fine, as this is the sum of levels in the factors.

Now, when I try to predict the Bin value based on City and Person, I get an error:

First I format the City and Person I want to predict for as a table. Then, I predict using this as an input.

sq<-"SELECT City, Person FROM [dbo].[formatAsTable]('London','George')"
pred<-RxSqlServerData(sqlQuery = sq,connectionString = connStr
                      , colClasses = c(City = "factor",Person="factor"))

If you check the pred object, it looks as expected:

> head(pred)
    City Person
1 London George

Now when I try to predict, I get an error.

scoredOutput <- RxSqlServerData(
  connectionString = connStr,
  table = "binaryOutput"
)

rxPredict(modelObject = isWonObj, data = pred, outData = scoredOutput, 
          predVarNames = "Score", type = "response", writeModelVars = FALSE, overwrite = TRUE,checkFactorLevels = FALSE)

The error says:

INTERNAL ERROR: In rxPredict, the number of parameters does not match the number of  variables: 3 vs. 11. 

I can see where the 11 comes from, but I have only supplied 2 values to the predict query - so I can't see where the 3 comes from, or why there is a problem.

Any assistance is appreciated!

Hamilton Blake
  • 622
  • 1
  • 6
  • 19

3 Answers3

0

The answer appears to be consistent with how R treats factor variables, however the error message could have made a more clear distinction between factors, levels, variables and parameters.

It appears that the parameters input to generate a prediction cannot simply be a character or factor with no levels. They need to have the same levels as the factors of the same variable used in the model parameterisation.

As such, the following lines:

sq<-"SELECT City, Person FROM [dbo].[formatAsTable]('London','George')"
pred<-RxSqlServerData(sqlQuery = sq,connectionString = connStr
                      , colClasses = c(City = "factor",Person="factor"))

... should be replaced with this:

sq<-"SELECT City, Person FROM [dbo].[formatAsTable]('London','George')"

column_information<-list(
  City=list(type="factor",levels=c("London","New York","Liverpool","Paris","Sydney","Mexico City")),
  Person=list(type="factor",levels=c("John","Paul","George","Ringo")),
  Bin=list(type="logical")
)

pred<-RxSqlServerData(sqlQuery = sq,connectionString = connStr
                      ,colInfo=column_information,
                      stringsAsFactors=FALSE)

I have seen other examples with categorical variables which seem to work without this, but maybe the levels were in there anyway.

I hope this saves someone as many hours as I lost on it!

Edit for SLSvenR's Response

I think my comment regarding having the same levels as the training set still holds.

fac <- c("one", "two", "three")
val = c(1, 2, 3)
trainingData <- data.frame(fac, val, stringsAsFactors = TRUE)
lmModel <- lm(val ~ fac, data = trainingData)
print(summary(lmModel))
predictionData = data.frame(fac = c("one", "three", "one", "one"))
lmPred <- predict(lmModel, newdata = predictionData)
lmPred
# The result is OK:
# 1 2 3 4
# 1 3 1 1

levels(predictionData$fac)<-levels(trainingData$fac)
# rxLinMod() and rxPredict() behave different:
rxModel <- rxLinMod(val ~ fac, data = trainingData)
rxPred <- rxPredict(rxModel, data = predictionData, writeModelVars = TRUE,checkFactorLevels = TRUE)
rxPred
# This result appears correct to me.

I can't comment on whether this is good or bad - however it looks like one way to get around this is to apply the levels of the training data to the test set, which I assume you can do in real time.

Hamilton Blake
  • 622
  • 1
  • 6
  • 19
0

Are you sure specifying colInfo fixes the problem? It looks like there is a general issue in rxPredict rather than rxPredict in combination with SQL Server:

# lm() and predict() don't have a problem with missing factor levels ("two" in this case):
fac <- c("one", "two", "three")
val = c(1, 2, 3)
trainingData <- data.frame(fac, val, stringsAsFactors = TRUE)
lmModel <- lm(val ~ fac, data = trainingData)
print(summary(lmModel))
predictionData = data.frame(fac = c("one", "three", "one", "one"))
lmPred <- predict(lmModel, newdata = predictionData)
lmPred
# The result is OK:
# 1 2 3 4
# 1 3 1 1

# rxLinMod() and rxPredict() behave different:
rxModel <- rxLinMod(val ~ fac, data = trainingData)
rxPred <- rxPredict(rxModel, data = predictionData, writeModelVars = TRUE)
# The following error is thrown:
# "INTERNAL ERROR: In rxPredict, the number of parameters does not match
# the number of  variables: 3 vs. 4."
# checkFactorLevels = FALSE doesn't help here, it actually seems to just
# check the order of factor levels.
levels(predictionData$fac) <- c("two", "three", "one")
rxPred <- rxPredict(rxModel, data = predictionData, writeModelVars = TRUE)
# The following error is thrown (twice):
# ERROR:order of factor levels in the data are inconsistent with
# the order of the model coefficients:fac = two versus fac = one. Set
# checkFactorLevels = FALSE to ignore.
rxPred <- rxPredict(rxModel, data = predictionData, checkFactorLevels = FALSE, writeModelVars = TRUE)
rxPred
#   val_Pred    fac
#1  1           two
#2  3           three
#3  1           two
#4  1           two
# This looks suspicious at best. While the prediction values are still
# correct if you look only at the order of the records in trainingData,
# the model variables are messed up.

In my scenario I have a factor with about 10.000 Levels (only known during the creation of the model) and several more factors with about 5 levels each (known before creation of the model). It seems impossible to specify the levels for all of them while calling rxPredict() in the "correct" order.

SLSvenR
  • 101
  • 1
  • 3
0

While only setting the factor levels (... levels(predictionData$fac)<-levels(trainingData$fac ...) avoids the error it also leads to wrong factor indices used by the model, which can be seen if writeModelVars is set to TRUE. Setting colInfo for my factor with almost 10.000 levels in RxSqlServerData resulted in an application hang, although the query was passed to SQL Server correctly. I changed my strategy into loading the data into a data frame without any factors and then apply RxFactors to it:

rxSetComputeContext("local")

sqlPredictQueryDS <- RxSqlServerData(connectionString = sqlConnString, sqlQuery = sqlQuery, stringsAsFactors = FALSE)

predictQueryDS = rxImport(sqlPredictQueryDS)

if ("Artikelnummer" %in% colnames(predictQueryDS)) { predictQueryDS <- rxFactors(predictQueryDS, factorInfo = list(Artikelnummer = list(levels = allItems))) }

In addition to setting the needed factor levels RxFactors also reorders the factor indices. I'm not saying the solution with colInfo is wrong, maybe it just doesn't work for factors with "too many" levels.

SLSvenR
  • 101
  • 1
  • 3