0

Currently I have two SQL query proposals which call DATALENGTH on the same VARBINARY column more than once, due to conditions in the WHERE clause.

Do these multiple calls to DATALENGTH in the same query interfere in the performance, or can the result of the DATALENGTH calculation be optimized/cached by Microsoft SQL Server?

The goal is to build an EndsWith SQL query for VARBINARY(MAX) FILESTREAM columns.

In my case, I cannot add any extra indexes or columns to the table, however I'd like to understand what are the possible alternatives.

-- Table with BinData
DECLARE @foo TABLE(BinData VARBINARY(MAX));
INSERT INTO @foo (BinData) VALUES
(0x00001125), (0x00112500), (0x11250000),
(0x00000011), (0x00001100), (0x00110000), (0x11000000),
(0x112500001125);

-- BinKey
DECLARE @BinKey VARBINARY(MAX) = 0x1125;

-- Search for [BinData] ending with @BinKey --

What I have tried so far...

Proposal A - REVERSE BinKey and BinData, and then use CHARINDEX.

SELECT * FROM @foo WHERE
    DATALENGTH(BinData) >= DATALENGTH(@BinKey) AND
    CHARINDEX
    (
        CONVERT(VARCHAR(MAX), @BinKey),
        CONVERT(VARCHAR(MAX), BinData)
    ) = 1 + DATALENGTH(BinData) - DATALENGTH(@BinKey);

Proposal B - Cut the ending part of the BinData and compare it to BinKey.

SELECT * FROM @foo WHERE
    DATALENGTH(BinData) >= DATALENGTH(@BinKey) AND
    SUBSTRING(
        BinData,
        1 + DATALENGTH(BinData) - DATALENGTH(@BinKey),
        DATALENGTH(@BinKey)
    ) = @BinKey;

Proposal from James L. - Simply use the RIGHT function.

SELECT * FROM @foo WHERE
    RIGHT(BinData, DATALENGTH(@BinKey)) = @BinKey;

The result for the queries listed above must be:

0x00001125
0x112500001125

Is there any better algorithm for this? How well would it perform for larger blobs?

sɐunıɔןɐqɐp
  • 3,332
  • 15
  • 36
  • 40
  • 1
    SQL Server certainly has *knowledge* that `DATALENGTH` is [deterministic](https://learn.microsoft.com/en-us/sql/relational-databases/user-defined-functions/deterministic-and-nondeterministic-functions?view=sql-server-2017). – Damien_The_Unbeliever Apr 19 '18 at 09:14
  • @Damien_The_Unbeliever: Yes, deterministic... but does that mean calling DATALENGTH multiple times with the same column as argument in the WHERE CLAUSE causes no extra computation? – sɐunıɔןɐqɐp Apr 19 '18 at 09:43
  • It means it has the *opportunity* to reduce the number of calls it makes to a minimum. Whether it does or not would be something you'd need to inspect the *specific* execution plan for your query to determine. – Damien_The_Unbeliever Apr 19 '18 at 09:47
  • 1
    Internally, `VAR` values are stored with their length, so I don't see how `DATALENGTH` would ever be anything but constant time, making it largely irrelevant how often you take the length of the same column/variable. But as always, measuring is king. Constructing large blobs isn't rocket science. (For `FILESTREAM`, it might be different, but I highly doubt it -- file systems don't record file lengths for nothing!) – Jeroen Mostert Apr 19 '18 at 09:48
  • 1
    This question is better suited for https://dba.stackexchange.com/ – Chris Schaller Jan 23 '21 at 03:10

1 Answers1

0

I'm not sure how to analyze how this will perform, but this does what you asked for:

-- Table with BinData
DECLARE @foo TABLE(BinData VARBINARY(MAX));
INSERT INTO @foo (BinData) VALUES
(0x00001125), (0x00112500), (0x11250000),
(0x00000011), (0x00001100), (0x00110000), (0x11000000),
(0x112500001125);

-- BinKey
DECLARE @BinKey VARBINARY(MAX) = 0x1125;

-- Search for [BinData] ending with @BinKey --
select *
from   @foo
where  right(BinData, datalength(@BinKey)) = @BinKey

Which returns the following:

BinData
----------------
0x00001125
0x112500001125

I added 10,000 records, each about 1kb in size. It took .020 seconds to query all of them. I also tried adding 10 records 1MB in size after the 10,000 1kb records. It took .037 seconds to run the query. When the 1kb or 1MB records end with 0x1125, then it increases the execution time by a little bit, because SSMS has to receive and format the record in the output window. 50 1kb rows that end in 0x1125 took .057 seconds to run. When one of the 1MB rows ended in 0x1125, it took .110 seconds to run. It's not very scientific, but it seems to perform adequately.

James L.
  • 9,384
  • 5
  • 38
  • 77
  • LEN() sometimes doesn't calculate the correct length of the binary data, especially in case of data starting/ending with 0x20, althout using RIGHT is a very good point (I've added your proposal to my question). DATALENGTH solves this issue. -- Try this: SELECT(LEN(0x2020252020), DATALENGTH(0x2020252020)); – sɐunıɔןɐqɐp Apr 19 '18 at 09:51
  • I have a lot of FILESTREAM data. This approach seems to work nicely for FILESTREAM fields too. – James L. Apr 19 '18 at 09:54
  • I see what you mean about `len()`. That seems like a bug MS should fix. – James L. Apr 19 '18 at 09:58
  • As Microsoft states: LEN excludes trailing blanks. If that is a problem, consider using the DATALENGTH (Transact-SQL) function which does not trim the string. - https://learn.microsoft.com/en-us/sql/t-sql/functions/len-transact-sql?view=sql-server-2017 – sɐunıɔןɐqɐp Apr 19 '18 at 10:05
  • @JamesL. They did, the fix is called `DATALENGTH`. The fact that trailing spaces "don't count" for comparison is enshrined by now, and clearly documented -- note that `'a ' = 'a'`, so `LEN('a ') = LEN('a')`. Even though `LEN` nominally operates on binary values, it's a string function at heart. – Jeroen Mostert Apr 19 '18 at 10:07