Performance tuning on matching word using CHARINDEX and COLLATION

Question

I have two tables with following records:

Table 1: 100 rows

Table 2: 50 Millions rows

Example:

Table 1: tb100

create table tb100
(
    name varchar(50)
);

insert into tb100 values('Mak John'),('Will Smith'),('Luke W')......100 rows.

Table 2: tb50mil

create table tb50mil
(
    name varchar(50)
);

insert into tb10mil values('John A Mak'),('K Smith Will'),('James Henry')......50 millions rows.    

create nonclustered index nci_tb10mil_name  on tb10mil(name);

Requirement: I want to match the name between two tables, if any WORD(John,Smith,Will) present in another table. For example John present in John A Mark.

My try: Used XML to split column name of the table tb100 and also adding collation with CHARINDEX.

;WITH splitdata AS 
(
    SELECT splitname 
    FROM   
    (
        SELECT *,Cast('<X>' + Replace(t.name, ' ', '</X><X>') + '</X>' AS XML) AS  xmlfilter 
        FROM   tb100 t
    )F1 
    CROSS apply 
    (
        SELECT fdata.d.value('.', 'varchar(50)') AS splitName 
        FROM   f1.xmlfilter.nodes('X') AS fdata(d)
    ) O
) 
SELECT t2.name AS [Aadhar Names]
FROM   tb50mil t2 
INNER JOIN splitdata S 
ON CHARINDEX(S.splitname collate Latin1_General_BIN,T2.name collate Latin1_General_BIN)>0
GROUP BY t2.name

Time taken to execution: 00:01:34
Rows affected: (2251429 row(s) affected)

Execution Plan:

Perhaps you can create a schema-bound view that contains all substrings of a space delimited string (you need a deterministic string split function). Rest is straight forward. — Salman A, Jan 30 '19 at 08:31

Xaphas · Answer 1 · 2019-01-30T07:11:38.207

If you need the separate words within your name then maybe having a table where your name is just one string isn't optimal conceptionally. Also, separating the names now is painful since you have no reoccurring pattern to your middle names. Plus string modification is really not a SQL strength. I would instead extend your table into something like this:

alter table tb100 
add 
nameID int IDENTITY(1,1) NOT NULL,
first_name varchar(50) null,
middle_name varchar(50) null,
last_name varchar(50) null

insert into tb100 values('Mak John'),('Will Smith'),('Luke W')......100 rows.

if (SELECT LEN(col) - LEN(REPLACE(col, ' ', '')) > 1)
update tb100
set
first_name = (Select Substring(name, 0, (Charindex(' ', name)))),
middle_name = (Select Substring( Right(name, (LEN(name) - (Charindex(' ', name) +1)), 0, (Charindex(' ', name)))),
last_name = (select Substring( Right(middle_name, (LEN(middle_name) - (Charindex(' ', middle_name) + 1)), LEN(LEN(middle_name) - (Charindex(' ', middle_name) + 1))

else 
update tb100
set
first_name = (Select Substring(name, 0, (Charindex(' ', name)))),
middle_name = '',
last_name = (Select Substring( Right(name, (LEN(name) - (Charindex(' ', name) +1)), 0, (Charindex(' ', name))))

I hope it works I didn't have a chance to test it because I'm on the road. If you have the chance of inserting the data into those columns without this entire modification then please do so. You then do the same to your other table...

alter table tb50mil 
add 
nameID int IDENTITY(1,1) NOT NULL,
first_name varchar(50) null,
middle_name varchar(50) null,
last_name varchar(50) null

insert into tb10mil values('John A Mak'),('K Smith Will'),('James Henry')......50 million rows.

if (SELECT LEN(col) - LEN(REPLACE(col, ' ', '')) > 1)
update tb50mil
set
first_name = (Select Substring(name, 0, (Charindex(' ', name)))),
middle_name = (Select Substring( Right(name, (LEN(name) - (Charindex(' ', name) +1)), 0, (Charindex(' ', name)))),
last_name = (select Substring( Right(middle_name, (LEN(middle_name) - (Charindex(' ', middle_name) + 1)), LEN(LEN(middle_name) - (Charindex(' ', middle_name) + 1))

else 
update tb50mil
set
first_name = (Select Substring(name, 0, (Charindex(' ', name)))),
middle_name = '',
last_name = (Select Substring( Right(name, (LEN(name) - (Charindex(' ', name) +1)), 0, (Charindex(' ', name))))

and from here on it's a simple join really:

select * from tb100 hun
inner join
tb50mil mil on hun.first_name = mil.first_name OR hun.middle_name = mil.middle_name OR hun.last_name OR mil.last_name

Hope this helps!

Sorry I can't normalize here, because the same logic gonna use for other column like `Address`. — MAK, Jan 30 '19 at 07:12
I don't think you can get more performance out of your comparison if you can't normalize. You're always going to end up modifying strings, which naturally, costs a lot of resources. Maybe consider normalizing your other columns as well, it could really be worth your while. After all, this shouldn't even bother the rest of the database, since your `name` column stays where it is, the way it is. — Xaphas, Jan 30 '19 at 07:36

Performance tuning on matching word using CHARINDEX and COLLATION

1 Answers1