I run some R code after querying 100M records and get the following error after the process runs for over 6 hours:
Msg 39004, Level 16, State 19, Line 300
A 'R' script error occurred during execution of 'sp_execute_external_script'
with HRESULT 0x80004005.
HRESULT 0x80004005 appears to be associated in Windows with Connectivity, Permissions or an "Unspecified" error.
I know from logging in my R code that the process never reaches the R script at all. I also know that the entire procedure completes after 4 minutes on a smaller number of records, for example, 1M. This leads me to believe that this is a scaling problem or some issue with the data, rather than a bug in my R code. I have not included the R code or the full query for proprietary reasons.
However, I would expect a disk or memory error to display a 0x80004004 Out of memory error if that were the case.
One clue I noticed in the SQL ERRORLOG is the following:
SQL Server received abort message and abort execution for major error : 18
and minor error : 42
However the time of this log line does not coincide with the interruption of the process, although it does occur after it started. Unfortunately, there is precious little on the web about "major error 18".
A SQL Trace when running from SSMS shows the client logging in and logging out every 6 minutes or so, but I can only assume this is normal keepalive behaviour.
The sanitized sp_execute_external_script call:
EXEC sp_execute_external_script
@language = N'R'
, @script = N'#We never get here
#returns name of output data file'
, @input_data_1 = N'SELECT TOP 100000000 FROM DATA'
, @input_data_1_name = N'x'
, @output_data_1_name = N'output_file_df'
WITH RESULT SETS ((output_file varchar(100) not null))
Server Specs: 8 cores 256 GB RAM SQL Server 2016 CTP 3
Any ideas, suggestions or debugging hints would be greatly appreciated!
UPDATE: Set TRACE_LEVEL=3 in rlauncher.config to turn on a higher level of logging and re-ran the process. The log reveals a cleanup process that ran, removing session files, at the time the entire process failed after 6.5 hours.
[2016-05-30 01:35:34.419][00002070][00001EC4][Info] SQLSatellite_LaunchSatellite(1, A187BC64-C349-410B-861E-BFDC714C8017, 1, 49232, nullptr) completed: 00000000
[2016-05-30 01:35:34.420][00002070][00001EC4][Info] < SQLSatellite_LaunchSatellite, dllmain.cpp, 223
[2016-05-30 08:04:02.443][00002070][00001EC4][Info] > SQLSatellite_LauncherCleanUp, dllmain.cpp, 309
[2016-05-30 08:04:07.443][00002070][00001EC4][Warning] Session A187BC64-C349-410B-861E-BFDC714C8017 cleanup wait failed with 258 and error 0
[2016-05-30 08:04:07.444][00002070][00001EC4][Info] Session(A187BC64-C349-410B-861E-BFDC714C8017) logged 2 output files
[2016-05-30 08:04:07.444][00002070][00001EC4][Warning] TryDeleteSingleFile(C:\PROGRA~1\MICROS~1\MSSQL1~1.MSS\MSSQL\EXTENS~1\MSSQLSERVER06\A187BC64-C349-410B-861E-BFDC714C8017\Rscript1878455a2528) failed with 32
[2016-05-30 08:04:07.445][00002070][00001EC4][Warning] TryDeleteSingleDirectory(C:\PROGRA~1\MICROS~1\MSSQL1~1.MSS\MSSQL\EXTENS~1\MSSQLSERVER06\A187BC64-C349-410B-861E-BFDC714C8017) failed with 32
[2016-05-30 08:04:08.446][00002070][00001EC4][Info] Session A187BC64-C349-410B-861E-BFDC714C8017 removed from MSSQLSERVER06 user
[2016-05-30 08:04:08.447][00002070][00001EC4][Info] SQLSatellite_LauncherCleanUp(A187BC64-C349-410B-861E-BFDC714C8017) completed: 00000000
It appears the only way to allow my long-running process to continue is to: a) Extend the Job Cleanup wait time to allow the job to finish b) Disable the Job Cleanup process
I have thus far been unable to find the value that sets the Job Cleanup wait time in the MSSQLLaunchpad service. While a JOB_CLEANUP_ON_EXIT flag exists in rlauncher.config, setting it to 0 has no effect. The service seems to reset it to 1 when it is restarted.
Again, any suggestions or assistance would be much appreciated!