1

I have a table with 5-10 million records which has 2 fields

example data

Row  Field1   Field2
------------------
1    0712334  072342344
2    06344534 083453454
3    06344534 0845645565

Given 2 variables

variable1 : 0634453445645
variable2 : 08345345456756

I need to be able to query the table for best matches as fast as possible

The above example would produce 1 record (e.g row 2)

What would be the fastest way to query the database for matches?

Note : the data and variables are always in this format (i.e always a number, may or may not have a leading zero, and fields are not unique however the combination of both will be )

My initial thought was to do something like this

Select blah where Field1 + "%" like variable1 and  Field2 + "%" like variable2

Please forgive my pseudo-code if it's not correct, as this is more a fact-finding mission. However I think I'm in the ball park.

Note : I don't think any indexing can help here, though a memory-based table I'm guessing would speed this up.

Can anyone think of a better way of solving the problem?

halfer
  • 19,824
  • 17
  • 99
  • 186
TheGeneral
  • 79,002
  • 9
  • 103
  • 141
  • Although all the answers helped me out, i think the @Mikael Eriksson answer got me closest at this stage, by thinking about plans and indexing via the SQL tuning advisor – TheGeneral Mar 05 '14 at 21:46

4 Answers4

1

You can get a plan with a seek on an index on Field1 with query like this.

declare @V1 varchar(20) = '0634453445645'
declare @V2 varchar(20) = '08345345456756'

select Field1,
       Field2
from YourTable
where Field1 like left(@V1, 4) + '%' and
      @V1 like Field1 + '%' and
      @V2 like Field2 + '%'

It does a range seek on the first four characters on Field1 and uses the full comparison on Field1 and Field2 in a residual predicate.

enter image description here

Mikael Eriksson
  • 136,425
  • 22
  • 210
  • 281
  • Wow this is the sort of smarts i was looking for, how do you think this would compare (performance) to alzaimars answer? – TheGeneral Mar 04 '14 at 07:47
  • @Saruman You have to test on your data. I do think my query is easier to build since you don't have to generate the in clauses using the passed in variable. – Mikael Eriksson Mar 04 '14 at 07:52
  • I like your approach very much. Thanks to this new idea in solving puzzles. – alzaimar Mar 04 '14 at 12:40
0

There is no performance tip. SImple like that.

%somethin% is table scan, Indices are not used due to the beginning %. Ful ltext indexing won't work as it is not a full text you seek but part of a word.

Getting a faster machine to handle the table scans and denormalizing is the only thing you can do. 5-10 million rows should be faste enough on a decent computer. Memory based table is not needed - just enough RAM to cache that table.

And that pretty much is it. Either find a way to get rid of the initial % or get hardware (mostly memory) fast enough to handle this.

OR - handle it OUTSIDE sql server. Load the 5-10 million rows into a search service and use a better data structure. SQL being generic has to make compromises. But again, the partial match will kill pretty much most approaches.

TomTom
  • 61,059
  • 10
  • 88
  • 148
  • Thanks or your reply, luckly i only need something% which i believe is faster, however i was thinking due to the nature of the data, i may be able to use reverse integers (dont ask me how) or maybe another approach. though i kind of expected this kind of answer, thanks again – TheGeneral Mar 04 '14 at 07:21
  • He is not asking for a '%foo%' search, but for kind of a 'reversed begins with'. So I guess there *is* an optimization possibility. – alzaimar Mar 04 '14 at 07:27
0

Postgres has trigram indexes http://www.postgresql.org/docs/current/interactive/pgtrgm.html

Maybe SQL Server has something like that?

Emery Lapinski
  • 1,572
  • 18
  • 26
  • I once tried 3-grams on an SQL-Server on a table with 6 Mio rows containing music titles and singer/band names. The resulting Trigram table contained approx 100 mio entries and searching was a pain. I switched to fulltext. However, this is not appropriate in this case. – alzaimar Mar 04 '14 at 07:29
  • Interesting, however i'm not sure if trigram would be the best fit for the problem, even if it is implemented in SQl Server, thanks for your input though – TheGeneral Mar 04 '14 at 07:30
0

What is the shortest length in Column 'Field1' and 'Field2'? Call this number 'N'.

Then create a select statement which asks for all substrings starting at the first character of length N to the length of each variable. Example (say, N=10)

select distinct * from myTable 
where Field1 in ('0634453445','06344534456','063445344564', '0634453445645')
  and Field2 in ('0834534545','08345345456','083453454567', '0834534545675','08345345456756')

Write a small script which creates the query for you. Of course there is much more to optimize but this requires (imho) changes in the structure of your table and I can imagine that this is something you don't want. At least you can give it a fast try.

Also, you should include the query plan when you try this approach in SSMS. The query plan will give you a nice hint in how to organize your index.

alzaimar
  • 4,572
  • 1
  • 16
  • 30
  • Hrm, the shortest combination is 4 characters the longest is about 9, i could definitely do some benchmarks to see if this gives some performance gains, ill be using linq so this is quite easy to achieve i'm guessing – TheGeneral Mar 04 '14 at 07:37