I have used Microsoft's "Data Science End to End Walkthrough" to set myself up with R Server, and their example works perfectly.
The example (New York taxi data) uses non-categorical variables (ie distance, taxi fare etc.) to predict a categorical variable (1 or 0 for whether or not a tip was paid).
I am trying to predict a similar binary output using categorical variables as an input, using linear regression (the rxLinMod function), and am coming up with an error.
The error says that the number of parameters does not match the number of variables, however it looks to me like the number of variables
is actually the number of levels within each factor (variable).
To Replicate
Create a table called example in SQL Server:
USE [my_database];
SET ANSI_NULLS ON;
SET QUOTED_IDENTIFIER ON;
CREATE TABLE [dbo].[example](
[Person] [nvarchar](max) NULL,
[City] [nvarchar](max) NULL,
[Bin] [integer] NULL
) ON [PRIMARY] TEXTIMAGE_ON [PRIMARY];
Put data in it:
insert into [dbo].[example] values ('John','London',0);
insert into [dbo].[example] values ('Paul','New York',0);
insert into [dbo].[example] values ('George','Liverpool',1);
insert into [dbo].[example] values ('Ringo','Paris',1);
insert into [dbo].[example] values ('John','Sydney',1);
insert into [dbo].[example] values ('Paul','Mexico City',1);
insert into [dbo].[example] values ('George','London',1);
insert into [dbo].[example] values ('Ringo','New York',1);
insert into [dbo].[example] values ('John','Liverpool',1);
insert into [dbo].[example] values ('Paul','Paris',0);
insert into [dbo].[example] values ('George','Sydney',0);
insert into [dbo].[example] values ('Ringo','Mexico City',0);
I also use a SQL function which returns variables in table format, as that's what it looks like is required from the Microsoft example. Create the function formatAsTable
:
USE [my_database];
SET ANSI_NULLS ON;
SET QUOTED_IDENTIFIER ON;
CREATE FUNCTION [dbo].[formatAsTable] (
@City nvarchar(max)='',
@Person nvarchar(max)='')
RETURNS TABLE
AS
RETURN
(
-- Add the SELECT statement with parameter references here
SELECT
@City AS City,
@Person AS Person
);
We now have a table with two categorical variables - Person
, and City
.
Let's start predicting. In R, run the following:
library(RevoScaleR)
# Set up the database connection
connStr <- "Driver=SQL Server;Server=<servername>;Database=<dbname>;Uid=<uid>;Pwd=<password>"
sqlShareDir <- paste("C:\\AllShare\\",Sys.getenv("USERNAME"),sep="")
sqlWait <- TRUE
sqlConsoleOutput <- FALSE
cc <- RxInSqlServer(connectionString = connStr, shareDir = sqlShareDir,
wait = sqlWait, consoleOutput = sqlConsoleOutput)
rxSetComputeContext(cc)
# Set the SQL which gets our data base
sampleDataQuery <- "SELECT * from [dbo].[example] "
# Set up the data source
inDataSource <- RxSqlServerData(sqlQuery = sampleDataQuery, connectionString = connStr,
colClasses = c(City = "factor",Bin="logical",Person="factor"
),
rowsPerRead=500)
Now, set up the linear regression model.
isWonObj <- rxLinMod(Bin ~ City+Person,data = inDataSource)
Look at the model object:
isWonObj
Notice it looks like this:
...
Total independent variables: 11 (Including number dropped: 3)
...
Coefficients:
Bin
(Intercept) 6.666667e-01
City=London -1.666667e-01
City=New York 4.450074e-16
City=Liverpool 3.333333e-01
City=Paris 4.720871e-16
City=Sydney -1.666667e-01
City=Mexico City Dropped
Person=John -1.489756e-16
Person=Paul -3.333333e-01
Person=George Dropped
Person=Ringo Dropped
It says there are 11 variables, which is fine, as this is the sum of levels in the factors.
Now, when I try to predict the Bin
value based on City
and Person
, I get an error:
First I format the City
and Person
I want to predict for as a table. Then, I predict using this as an input.
sq<-"SELECT City, Person FROM [dbo].[formatAsTable]('London','George')"
pred<-RxSqlServerData(sqlQuery = sq,connectionString = connStr
, colClasses = c(City = "factor",Person="factor"))
If you check the pred
object, it looks as expected:
> head(pred)
City Person
1 London George
Now when I try to predict, I get an error.
scoredOutput <- RxSqlServerData(
connectionString = connStr,
table = "binaryOutput"
)
rxPredict(modelObject = isWonObj, data = pred, outData = scoredOutput,
predVarNames = "Score", type = "response", writeModelVars = FALSE, overwrite = TRUE,checkFactorLevels = FALSE)
The error says:
INTERNAL ERROR: In rxPredict, the number of parameters does not match the number of variables: 3 vs. 11.
I can see where the 11 comes from, but I have only supplied 2 values to the predict query - so I can't see where the 3 comes from, or why there is a problem.
Any assistance is appreciated!