Extracting person names from a free-form text field

Question

I have a large table of 30 million records that contains a free-form text field which may contain names in any position and with any salutation, or not salutation at all.

My job is to mask out the names with Xxxxx Xxxxx to preserve privacy.

I have access to a large surnames database that defines for me what constitutes a name.

Using SQL Server 2012, what is the most efficient technique I can use for this task?

EDIT

Okay, I've got something working pretty decently that involves a Full-Text index/search, the names database, and a stored procedure.

However, I've run into a rather peculiar problem. I'm using a CONTAINS predicate (CONTAINS([textvaluefield], @namestring) where SET @namestring = 'NEAR((Dr.,'+@name+'), 1, TRUE)'.

This works perfectly except when the salutation in the [textvaluefield] is "DR." instead of "Dr.", i.e "DR. Johnson" is not getting picked up, yet "Dr. Johnson" is. I've verified this because if I change the value in the [textvaluefield] of a record from "DR." to "Dr.", yet leave everything else the same, that record will suddenly get picked up. If I revert the record to use "DR.", it will not get picked up again.

What make this bizarre is that I'm definitely using a case insensitive collation (Latin1_General_CI_AS). Anyone have any ideas?

So you have a free text field and you have a table of names and you want to remove all instances in the table of names out of that free text field? If a user named smith wrote iron smith in his post, how exactly would you know to remove his name 'smith' and not the term 'iron smith' (which would reduce iron smith to iron, assuming iron isn't a surname on your list, in which case iron smitrh would be entirely removed)? This seems like any result you get will be littered with exceptions. Here's hoping 'the' is not a surname — Twelfth, Aug 05 '14 at 23:49
Agreed, but we'll mitigate as follows. We won't actually delete the names. We'll mask them with 'Xxxxx'. This will of course turn iron smith into iron Xxxxx - not much of an improvement. But then we'll maintain both the original text field and the changed text field, and as we detect problems like this, we'll fix our historical problems and adjust our go-forward process. Not ideal, I know, but the privacy factor takes precedence, and as this is a field for medical notes, we think the collision rate will be low. — user3553088, Aug 06 '14 at 01:38
"What make this bizarre is that I'm definitely using a case insensitive collation (Latin1_General_CI_AS). Anyone have any ideas? " -- Make everything lower case with lower()...or upper if you prefer. -- Good idea with retaining everything in any event, you do not want data loss for this. — Twelfth, Aug 06 '14 at 16:18
Yes, I can attempt to force a case conversion, but here's why I posed the question: full-index search is supposed to convert everything to lower-case anyways, and it does this with numerous values, but not with DR vs Dr. That's the mystery I'm trying to solve! — user3553088, Aug 07 '14 at 21:26

score 0 · Answer 1 · answered Aug 29 '14 at 13:25

If you can verify that you don't have any records in your "stopwords" tables:

SELECT * FROM sys.[fulltext_system_stopwords] AS FSS WHERE [stopword] LIKE 'Dr_'
SELECT * FROM sys.[fulltext_stopwords] AS FS

I have also encountered a similar issue and resolved it by creating a schema bound view on the tables and columns you need and explicitly create a column using the LOWER function.

CREATE VIEW [User].[UserValues]
WITH
 SCHEMABINDING
AS
SELECT
        [UserId]
      , [UserName]
      , LOWER(Username]) AS [LoweredUsername]
    FROM
        [User].[Values]

Dont forget to add a unique clustered index for the full text to use.

Extracting person names from a free-form text field

1 Answers1