I am preparing for a data extraction task. I need to remove a set of terms; none, some or all may be present in each source record string. There are over 100,000 target records. I want to avoid performing single term match/replace actions, since (a) the list of terms-to-be-removed will likely grow, and (b) the time to perform the current match/replace action one term at a time is unacceptable.
My question: how do I modify the regular expression to include each term within the OR separated list?
REGULAR EXPRESSION
' and | and or | a o | company | co | c o | dba | d b a '
DESIRED BEHAVIOR
Replace each found term (including the prefix and suffix spaces) with a single space.
ACTUAL BEHAVIOR
Every "even" (as opposed to "odd") term found is replaced (including the prefix and suffix spaces) with a single space.
EXAMPLE
Source String
' MASHABLE LTD DBA THE INFORMATION EXPERTS and and or a o company co c o dba d b a COPYRIGHT '
Result String (Desired Behavior)
' MASHABLE LTD THE INFORMATION EXPERTS COPYRIGHT '
Result String (Actual Behavior)
' MASHABLE LTD THE INFORMATION EXPERTS and or company c o d b a COPYRIGHT '
ENVIRONMENT
SQL Server 2005
User Defined Function regexReplace relying on VBScript.RegExp (code available at end of post)
CODE
set nocount on
declare @source [varchar](800)
declare @regexp [varchar](400)
declare @replace [char](1)
declare @globalReplace [bit]
declare @ignoreCase [bit]
declare @result [varchar](800)
set @globalReplace = 1
set @ignoreCase = 1
SET @source = ' MASHABLE LTD DBA THE INFORMATION EXPERTS and and or a o company co c o dba d b a COPYRIGHT '
set @regexp = ' and | and or | a o | company | co | c o | dba | d b a '
set @replace = ' '
select @result = master.dbo.regexReplace(@source,@regexp,@replace,@globalReplace,@ignoreCase)
print @result
... producing the result:
MASHABLE LTD THE INFORMATION EXPERTS and or company c o d b a COPYRIGHT
* dbo.regexReplace user-defined function definition *
CREATE FUNCTION [dbo].[regexReplace]
(
@source varchar(5000),
@regexp varchar(1000),
@replace varchar(1000),
@globalReplace bit = 0,
@ignoreCase bit = 0
)
RETURNS varchar(1000) AS
BEGIN
DECLARE @hr integer
DECLARE @objRegExp integer
DECLARE @result varchar(5000)
EXECUTE @hr = sp_OACreate 'VBScript.RegExp', @objRegExp OUTPUT
IF @hr <> 0
BEGIN
EXEC @hr = sp_OADestroy @objRegExp
RETURN NULL
END
EXECUTE @hr = sp_OASetProperty @objRegExp, 'Pattern', @regexp
IF @hr <> 0
BEGIN
EXEC @hr = sp_OADestroy @objRegExp
RETURN NULL
END
EXECUTE @hr = sp_OASetProperty @objRegExp, 'Global', @globalReplace
IF @hr <> 0
BEGIN
EXEC @hr = sp_OADestroy @objRegExp
RETURN NULL
END
EXECUTE @hr = sp_OASetProperty @objRegExp, 'IgnoreCase', @ignoreCase
IF @hr <> 0
BEGIN
EXEC @hr = sp_OADestroy @objRegExp
RETURN NULL
END
EXECUTE @hr = sp_OAMethod @objRegExp, 'Replace', @result OUTPUT, @source, @replace
IF @hr <> 0
BEGIN
EXEC @hr = sp_OADestroy @objRegExp
RETURN NULL
END
EXECUTE @hr = sp_OADestroy @objRegExp
IF @hr <> 0
BEGIN
RETURN NULL
END
RETURN @result
END