5

Here's a problem I've repeatedly encountered while playing with the Stack Exchange Data Explorer, which is based on T-SQL:

How to search for a string except when it occurs as a substring of some other string?

For example, how can I select all records in a table MyTable where the column MyCol contains the string foo, but ignoring any foos that are part of the string foobar?

A quick and dirty attempt would be something like:

SELECT * 
FROM MyTable 
WHERE MyCol LIKE '%foo%' 
  AND MyCol NOT LIKE '%foobar%'

but obviously this will fail to match e.g. MyCol = 'not all foos are foobars', which I do want to match.

One solution I've come up with is to replace all occurrences of foobar with some dummy marker (that is not a substring of foo) and then checking for any remaining foos, as in:

SELECT * 
FROM MyTable 
WHERE REPLACE(MyCol, 'foobar', 'X') LIKE '%foo%'

This works, but I suspect it's not very efficient, since it has to run the REPLACE() on every record in the table. (For SEDE, this would typically be the Posts table, which currently has about 30 million rows.) Are the any better ways to do this?

(FWIW, the real use case that prompted this question was searching for SO posts with image URLs that use the http:// scheme prefix but do not point to the host i.stack.imgur.com.)

Ilmari Karonen
  • 49,047
  • 9
  • 93
  • 153
  • Are you playing with the hosted version, or are you downloading the data to play with on your local system? If you are downloading locally, or have the ability to if you didn't know that was an option, then you can add RegEx functionality via SQLCLR. For example, you could download the [SQL#](http://SQLsharp.com/) library (which I wrote, but the RegEx stuff is in the Free version), install it into a `Utility` DB, and then use it in queries for this or other stuff :-). – Solomon Rutzky Feb 01 '16 at 22:00
  • @srutzky: I'm using the hosted DB. I suppose I could look into downloading the data, but a solution that works online would be preferable. – Ilmari Karonen Feb 01 '16 at 22:10

4 Answers4

5

Neither of the ways given so far are guaranteed to work as advertised and only perform the REPLACE on a subset of rows.

SQL Server does not guarantee short circuiting of predicates and can move compute scalars up into the underlying query for derived tables and CTEs.

The only thing that is (mostly) guaranteed to work is the CASE statement. Below I use the syntactic sugar variety of IIF that expands out to CASE

SELECT *
FROM   MyTable
WHERE  1 = IIF(MyCol LIKE '%foo%', 
               IIF(REPLACE(MyCol, 'foobar', 'X') LIKE '%foo%', 1, 0), 
               0);
Community
  • 1
  • 1
Martin Smith
  • 438,706
  • 87
  • 741
  • 845
1

A three-stage filter should work:

  1. collect all rows matching '%foo%';

  2. replace all instances of 'foobar' with a non-occurring string (such as '' perhaps);

  3. Check again for matching '%foo%'

Here you only perform the REPLACE on potentially matching rows, not all rows. If you are expecting only a small percentage of matches, this should be much more efficient.

SQL would look like this:

;with data as (
    select * 
    from MyTable 
    where MyCol like '%foo%'      
)
select *
from data
where replace(MyCol, 'foobar', 'X') like '%foo%'

Note that a sub-query is required, as there are no expression short-cuts in SQL; the engine is free to reorder Boolean terms as desired for efficient processing within a singe query level.

Pieter Geerkens
  • 11,775
  • 2
  • 32
  • 52
1

This will be faster than your current query:

SELECT * 
FROM MyTable 
WHERE 
  MyCol like '%foo%' AND
  REPLACE(MyCol, 'foobar', 'X') LIKE '%foo%'

The REPLACE is calculated after MyCol has been applied, so this is faster than just:

REPLACE(MyCol, 'foobar', 'X') LIKE '%foo%'
t-clausen.dk
  • 43,517
  • 12
  • 59
  • 92
0

Assuming you're only interested in finding instances of foo with spaces surrounding them

 SELECT * 
 FROM MyTable 
 WHERE MyCol LIKE 'foo %' OR MyCol LIKE '% foo %' OR MyCol LIKE '% foo'
Paul Hunt
  • 3,395
  • 2
  • 24
  • 39
  • Alas, this will not match e.g. `'a fooing fooer foos the foos'`, which I do want to match. Let me update my example in the question. – Ilmari Karonen Feb 01 '16 at 12:00
  • So you want to match fooing, fooer and foos but not foobar? – Paul Hunt Feb 01 '16 at 12:04
  • Yes. The real use case that prompted this question was actually finding posts with image URLs that use the `http://` scheme prefix but do *not* point to the host `i.stack.imgur.com`. – Ilmari Karonen Feb 01 '16 at 12:08