I found out that you can execute R within U-SQL. So i took a R-script from one of our data-scientists and build a U-SQL script based on this sample script.
The adapted script:
DECLARE @INPUT_DAT string =
@"/Samples/Data/dat2json/validationData.dat.201805271617";
DECLARE @OUTPUT string = @"/Samples/Output/validationdata.out";
REFERENCE ASSEMBLY [ExtR];
DECLARE @myRScript = @"
datavector <- as.vector(readBin(@INPUT_DAT, "double", size = 4, n = 99000))
Size <- length(datavector)
numberOfPixels <- Size / 84
MaterialBase <- factor(rep(c("Plastic", "Aluminum"), each = (Size / 2)))
ThicknessBase <- factor(rep(c(rep(c(0, 10, 20, 30, 40, 50), times = 7),
rep(c(0.0, 0.5, 1.0, 1.5, 2.0, 2.5, 3.0), each = 6)), each = numberOfPixels))
ThicknessIterated <- factor(rep(c(rep(c(0.0, 0.5, 1.0, 1.5, 2.0, 2.5, 3.0),
each = 6), rep(c(0, 10, 20, 30, 40, 50), times = 7)), each = numberOfPixels))
Pixel <- rep(1:numberOfPixels, times = 84)
dflabel <- data.frame(MaterialBase, ThicknessBase, ThicknessIterated, Pixel,
Value = datavector)
";
@RScriptOutput = REDUCE @myRScript USING new
Extension.R.Reducer(command:@myRScript, rReturnType:"dataframe");
OUTPUT @ScriptOutput
TO @OUTPUT
USING Outputters.Tsv();
The problem is that when I build the code, Visual Studio stops on line 6, after @". Intellisense also show a red ~ sign indicating that something is wrong. The error it generates is: Expected one od: OPTION ';'
The R-script works perfectly in R-studio.
Update 2018-07-19: I have narrowed it a bit down. The problem is the double quotes in the @myRScript variable. So I changed the code to the following:
DECLARE @INPUT_DAT string =
@"/dat2json/data/validationData.dat.201805271617";
DECLARE @OUTPUT string = @"/dat2json/data/validationdata.out";
DECLARE @vartype string = "double";
DECLARE @var1 string = "Plastic";
DECLARE @var2 string = "Aluminum";
REFERENCE ASSEMBLY [ExtR];
DECLARE @myRScript string = @"
datavector <- as.vector(readBin(@INPUT_DAT, @vartype, size = 4, n = 99000))
Size <- length(datavector)
numberOfPixels <- Size / 84
MaterialBase <- factor(rep(c(@var1, @var2), each = (Size / 2)))
ThicknessBase <- factor(rep(c(rep(c(0, 10, 20, 30, 40, 50), times = 7),
rep(c(0.0, 0.5, 1.0, 1.5, 2.0, 2.5, 3.0), each = 6)), each = numberOfPixels))
ThicknessIterated <- factor(rep(c(rep(c(0.0, 0.5, 1.0, 1.5, 2.0, 2.5, 3.0),
each = 6), rep(c(0, 10, 20, 30, 40, 50), times = 7)), each = numberOfPixels))
Pixel <- rep(1:numberOfPixels, times = 84)
dflabel <- data.frame(MaterialBase, ThicknessBase, ThicknessIterated, Pixel,
Value = datavector)
";
@RScriptOutput = REDUCE @myRScript ON MaterialBase USING new
Extension.R.Reducer(command:@myRScript, rReturnType:"dataframe");
OUTPUT @ScriptOutput
TO @OUTPUT
USING Outputters.Tsv();
But now I get an other error: E_CSC_USER_ROWSETVARIABLENOTFOUND: Rowset variable @myRScript was not found. Description: Rowset variables must be assigned to before they can be referenced. Resolution: Assign a rowset to the rowset variable or remove the reference.
Looks like I have to put the rsult of the R-script into a variable an use that one in the REDUCE statement. But how to do that?