1

We have a SQL Server database table with about 60million records. These are records of names and addresses of specific entities. Table contains following columns:

[Name] [nvarchar](425) NOT NULL,
[Street] [nvarchar](900) NULL,
[City] [nvarchar](900) NULL,
[State] [nvarchar](900) NULL,
[PostalCode] [nvarchar](100) NULL

What we have to achive is to be able to perform specific select statement under 1 second.

We should be able to select records based on whether '[Name]' contains one or several entered words (not 'exact match' and not 'starts with') and then apply next prioritization logic:

  1. Display on top records that are located in given [State] and [City]
  2. Display items that are in given [State] but another city
  3. Display items that are located in other states

Here is what we have tried:

  1. We tried to rebuild table in variety ways, extracting different column in different table, different sets of indexes, extracting every word as a token in separate folder
  2. SQL Server full text search. (to match records with 'Contains' function)
  3. Azure Cosmos DB. We migrated data there to evaluate if we could perform selects efficient enough

The problem is always to prioritize records based on state+city

The question is how we could achieve ability to perform selects under 1 second for this case using SQL Server or any other Data Source (preferably available on Azure)

Stanislav
  • 165
  • 1
  • 15
  • Can you make these columns narrower? What city and/or state on Earth has 900 characters? Or a 100 character postal code? – Jacob H Mar 21 '19 at 16:58
  • Did you try columnstore indexes in your experimentation? How many results do you typically get for the `Name` predicate that need ordering by the other conditions? – Martin Smith Mar 21 '19 at 17:20
  • @MartinSmith yea, I did. Actually number of results is one of the problems. It could be 200k+ – Stanislav Mar 21 '19 at 17:24
  • I don't know if it's possible for you to change your design a little bit. .if it is then you should normalize your design. .it's not appropriate to keep city name as text in address table...you can have a city table and reference a foreign key to address table..then you'll be able to filter the small table of cities and join it with address table .. – saman tr Mar 21 '19 at 17:38
  • @samantarighpeima that makes sense, but do you think it may help to resolve problem with performance? – Stanislav Mar 21 '19 at 18:30
  • Does the set of "one or several entered words" that you mentioned change? Is the 60 million record count set, or will that table grow? – Brian Mar 21 '19 at 18:32
  • @stanislav of course...where clause on a small table gives you a quick results. .then your join on two tables are on two integer unique indexed columns..normalization is the first step..then you can consider different tuning tricks to increase the performance... – saman tr Mar 21 '19 at 18:39
  • @Brian to give you more context all those records represent any kind of businesses/organizations (like banks, schools, hospitals, etc). and we basically want to provide user with ability to search them. so, yes, user could enter any words basically there – Stanislav Mar 21 '19 at 18:39
  • That doesn't answer my questions, @Stanislav. – Brian Mar 21 '19 at 18:42
  • Same goes for states..actually state should be in relation with city as parent table.. – saman tr Mar 21 '19 at 18:43
  • @Brian sorry. Yes, those could be any words. It is unlikely that size of table will change – Stanislav Mar 21 '19 at 18:44
  • @samantarighpeima I think I forgot to mention that 'City' and 'State' are both non clustered indexes. I believe that it is not going to bring much difference to join with a table than select using non clustered index. But I will try to test this hypothesis – Stanislav Mar 21 '19 at 18:50

1 Answers1

1

The only thing I can think of, beyond normalizing out City, State, and Zip and sizing those fields appropriately, is to make a word list table:

Create Table tbl_Entity
(
    [ID] [Int] Identity Not Null,
    [Name] [nvarchar](425) NOT NULL,
    [Street] [nvarchar](900) NULL,
    [City] [nvarchar](900) NULL,
    [State] [nvarchar](900) NULL,
    [PostalCode] [nvarchar](100) NULL
)

Create Table tbl_Entity_Name_Elements
(
    [ID] [Int] Identity Not Null,
    [Entity_ID] [Int] Not Null,   -- foreign key to tbl_Entity
    [Name_Element] [nvarchar](100) Null
)

Have a routine (a nightly job, maybe) that populates bl_Entity_Name_Elements by parsing the rows in tbl_Entity. Index tbl_Entity_Name_Elements on Name_Element, and you should be able to get the Entity_ID values that contain all of a given list of words fairly quickly, and that ought to be SARG-able. That gives you the tbl_Entity items that you need. Does that make sense?

Brian
  • 1,238
  • 2
  • 11
  • 17
  • that is a good idea. In one of the experiments I've tried this approach It did bring value but it is still not fast enough. Especially when there are a lot of matches the most challenging part is to prioritize results fast enough (to display on top records from given state and city, and from given state but other cities and then all other) – Stanislav Mar 21 '19 at 19:42
  • 1
    Have you looked at a scoring function, where you get a point for matching state, another point for matching city, and then throwing `Order by Proximity_Score Desc` into your query? Also, I have to ask: What, exactly, is driving the sub-second query execution requirement? How is this data being consumed? Is SQL Server query performance *really* the bottleneck on the user experience? – Brian Mar 21 '19 at 19:54
  • 1
    your comment made me revisit this idea and after some further schema refinements I made some progress here, so I accept your answer – Stanislav Mar 22 '19 at 20:08
  • I'm very glad to hear that, @Stanislav. I would love to know how fast you get your query to run, if you're willing to share. – Brian Mar 25 '19 at 14:12