After working on this and the related issue, I think it's pretty clear there's a bug in RODBC surrounding the handling of binary data types on SQL Server.
If I start with a connection:
# install.packages('RODBC')
library(RODBC)
connstr = sprintf('driver={ODBC Driver 17 for SQL Server};server=%s;database=%s;uid=%s;pwd=%s',
/* my parameters */)
dbhandle <- odbcDriverConnect(connstr)
and in that target database, I create a little table:
create table mikeb.repocase (runid int, bindata varbinary(max));
insert into mikeb.repocase (runid, bindata) values (33, 0x0102030405);
I should be able to select the data out. Here, the data is just five bytes so the chances that the memory corruption issue are encountered are low, but not zero. With my connection in R:
tbl1 <- sqlQuery(dbhandle, 'select bindata from mikeb.repocase WHERE runID = 33')
the data is returned. But it's incorrect:
> tbl1$bindata[[1]]
[1] 00 00 00 00 00
It's all zeroes, when 01 02 03 04 05
should be returned.
Casting the data to image
will get correct results back on Linux (Ubuntu):
> tbl2 <- sqlQuery(dbhandle, 'select CAST(bindata as image) as bindata from mikeb.repocase WHERE runID = 33')
> tbl2$bindata[[1]]
[1] 01 02 03 04 05
However, on the Windows RGui client, the same code fails with an error about allocating an impossible quantity of memory:
> tbl2 <- sqlQuery(dbhandle, 'select cast(bindata as image) as bindata from mikeb.repocase WHERE runID = 33')
Error in odbcQuery(channel, query, rows_at_time) :
'R_Calloc' could not allocate memory (214748364800 of 1 bytes)
On Linux, getting a non-trivial amount of data from a varbinary(max)
column (in my case, about 13 megabytes) causes a segfault failure sometimes, and a return of an object with the correct length but uninitialized data the rest of the time. (Is this a security concern?) However, using the IMAGE
cast seems to almost always work.
The offending code is in the RODBC cachenbind()
function, where the column description inspection happens and data binding is set up:
case SQL_BINARY:
case SQL_VARBINARY:
case SQL_LONGVARBINARY:
{
/* should really use SQLCHAR (unsigned) */
SQLLEN datalen = thisHandle->ColData[i].ColSize;
thisHandle->ColData[i].datalen = datalen;
thisHandle->ColData[i].pData = R_Calloc(nRows * (datalen + 1), char);
BIND(SQL_C_BINARY, pData, datalen);
}
The call to R_Calloc
is passed a size of one, and a quantity computed as nRows * (datalen + 1)
. datalen
came from the SQLDescribeCol()
API for this column.
For the IMAGE
data type, we get SQL_LONGVARBINARY
as the type and datalen
is -1. With that computation and its careless sign management, plus ignoring overflow, the quantity requested of R_Calloc
is 214748364800, matching the error message we see on Windows. (Since I'm not set up for C development on Linux just now, I can't comment on what's going on there. Maybe Linux manages to map some memory in response to this R_Calloc
call, or maybe it ends up getting different numbers from the Linux version of the same driver. Or, could be that something else happens.)
For the VARBINARY(max)
type, we get SQL_VARBINARY
as the type and a datalen of 0. The number of bytes requested of R_Calloc
, then, is just nRows
which is way smaller than expected -- looks like it's 100 bytes.
Curiously, nRows
is used here rather than the computed NROWS
.
Regardless, it seems as though we're simply lucky that RODBC can return binary data at all because it's not correctly handing the binding of the data or the allocation of the binding memory.
I've written to the package maintainer today, but I haven't received a response yet.