REGEXP_SUBSTR to return first and last segment

Question

I have a dataset which may store an account number in several different variations. It may contain hyphens or spaces as segment separators, or it may be fully concatenated. My desired output is the first three and last 5 alphanumeric characters. I'm having problems with joining the two segments "FIRST_THREE_AND_LAST_FIVE:

with testdata as (select '1-23-456-78-90-ABCDE' txt from dual union all
                  select '1 23 456 78 90 ABCDE' txt from dual union all
                  select '1234567890ABCDE' txt from dual union all
                  select '123ABCDE' txt from dual union all
                  select '12DE' txt from dual)
select TXT
       ,regexp_replace(txt, '[^[[:alnum:]]]*',null) NO_HYPHENS_OR_SPACES
       ,regexp_substr(regexp_replace(txt, '[^[[:alnum:]]]*',null), '([[:alnum:]]){3}',1,1) FIRST_THREE
       ,regexp_substr(txt, '([[:alnum:]]){5}$',1,1) LAST_FIVE
       ,regexp_substr(regexp_replace(txt, '[^[[:alnum:]]]*',null), '([[:alnum:]]){3}',1,1) FIRST_THREE_AND_LAST_FIVE
from  testdata;

My desired output would be:

FIRST_THREE_AND_LAST_FIVE
-------------------------
123ABCDE
123ABCDE
123ABCDE
123ABCDE
(null)

Ideally, I want to find something like ([[:alnum:]]){3} and ([[:alnum:]]){5}$ which will allow any number of characters in the middle. — Paul Young, Sep 28 '15 at 17:15

score 1 · Accepted Answer · answered Sep 28 '15 at 22:27

Here's my try. Note that when regexp_replace() does not find a match, the original string is returned, that's why you can't get a null directly. My thought was to see if the result string matched the original string but of course that would not work for line 4 where the result is correct and happens to match the original string. Others have mentioned methods for counting length, etc with a CASE but I would get more strict and check for the first 3 being numeric and the last 5 being alpha as well since just checking for 8 characters being returned doesn't guarantee they are the right 8 characters! I'll leave that up to the reader.

Anyway this looks for a digit followed by an optional dash or space (per the specs) and remembers the digit (3 times) then also remembers the last 5 alpha characters. It then returns the remembered groups in that order.

I highly recommend you make this a function where you pass your string in and get a cleaned string in return as it will be much easier to maintain, encapsulate this code for re-usability and allow for better error checking using PL/SQL code.

SQL> with testdata(txt) as (
  2    select '1-23-456-78-90-ABCDE' from dual
  3    union
  4    select '1 23 456 78 90 ABCDE' from dual
  5    union
  6    select '1234567890ABCDE'      from dual
  7    union
  8    select '123ABCDE'             from dual
  9    union
 10    select '12DE'                 from dual
 11  )
 12  select
 13    case when length(regexp_replace(upper(txt), '^(\d)[- ]?(\d)[- ]?(\d)[- ]?.*([A-Z]{5})$', '\1\2\3\4')) < 8
 14         -- Needs more robust error checking here
 15         THEN 'NULL'  -- for readability
 16      else regexp_replace(upper(txt), '^(\d)[- ]?(\d)[- ]?(\d)[- ]?.*([A-Z]{5})$', '\1\2\3\4')
 17    end result
 18  from testdata;

RESULT
--------------------------------------------------------------------------------
123ABCDE
123ABCDE
123ABCDE
123ABCDE
NULL

SQL>

I'll have to play with this some as it was my fault for not specifying that any of the characters can be alnum; but this looks like what I was after. I'll try to eliminate the case by wrapping your version of replace with a substr {8} to incorporate the null. But I fully agree with turning my final solution into a function for wider use and maintenance. — Paul Young, Sep 29 '15 at 02:47
Thank you, I was able to adapt this because of your `regex_replace`. I was also able to remove the `case` requirement by wrapping that function with a `regexp_substr`. I find this combined method much more elegant because it is only calling a single data field a single time. Final code: `,regexp_substr(regexp_replace(txt, '^([[:alnum:]])[- ]?([[:alnum:]])[- ]?([[:alnum:]])[- ]?.*([[:alnum:]]{5})$', '\1\2\3\4'), '([[:alnum:]]){8}',1,1) FIRST_THREE_AND_LAST_FIVE` — Paul Young, Sep 29 '15 at 19:52

Adam Martin · Answer 2 · 2015-09-28T19:55:39.893

0

I feel like I'm missing something, but can't you just concatenate your two working columns? I.e., since you have successful regex for first 3 and last 5, just replace FIRST_THREE_AND_LAST_FIVE with:

regexp_substr(regexp_substr(regexp_replace(txt, '[^[[:alnum:]]]*',null), '([[:alnum:]]){3}',1,1)||regexp_substr(txt, '([[:alnum:]]){5}$',1,1),'([[:alnum:]]){5}',1,1)

EDIT: Added regexp_substr wrapper to return null when required

edited Sep 28 '15 at 19:55

answered Sep 28 '15 at 16:08

Adam Martin

1,188
1
11
24

This does technically work, and I did consider it, but it requires two completely separate transformations and combines them together. – Paul Young Sep 28 '15 at 17:13
Actually this doesn't work; because the last record which should return `null` returns `12D`. I would have to wrap this in a case and check for string size. – Paul Young Sep 28 '15 at 17:25
I get `12D` on the last record. – Paul Young Sep 28 '15 at 19:04
Oops, sorry, was looking at the wrong column. Yeah, you would need to wrap it in a case statement or another regexp. – Adam Martin Sep 28 '15 at 19:51

score 0 · Answer 3 · answered Sep 28 '15 at 16:23

You can use the fact that the position parameter of REGEXP_REPLACE() can take back-references to get a lot closer. Wrapped in a CASE statement you get what you're after:

select case when length(regexp_replace(txt, '[^[:alnum:]]')) >= 8 then
            regexp_replace( regexp_replace(txt, '[^[:alnum:]]')
                          , '^([[:alnum:]]{3}).*([[:alnum:]]{5})$'
                          , '\1\2')
       end
  from test_data

This is, where the length of the string with all non-alpha-numeric characters replaced is greater or equal to 8 return the 1st and 2nd groups, which are respectively the first 3 and last 8 alpha-numeric characters.

This feels... overly complex. Once you've replaced all non-alpha-numeric characters you can just use an ordinary SUBSTR():

with test_data as (
select '1-23-456-78-90-ABCDE' txt from dual union all
select '1 23 456 78 90 ABCDE' txt from dual union all
select '1234567890ABCDE' txt from dual union all
select '123ABCDE' txt from dual union all
select '12DE' txt from dual
       )
, standardised as (
select regexp_replace(txt, '[^[:alnum:]]') as txt
  from test_data
       )
select case when length(txt) >= 8 then substr(txt, 1, 3) || substr(txt, -5) end
  from standardised

I considered this, not exactly, but a similar approach using a nested FROM, but I assumed that would be a tremendous amount of processing power to do two distinctly separate transformation operations. — Paul Young, Sep 28 '15 at 17:08
It's identical @Paul, you're doing them anyway. I've just done this for clarity. There's 3 functions here and only one of them is a regular expression (heavier). The minimum you're going to do this in is two regular expressions... unless you're working on tens of millions of records it's going to make no difference. TBH this'll be relatively quick on hundreds of millions... — Ben, Sep 28 '15 at 17:12

REGEXP_SUBSTR to return first and last segment

3 Answers3