0

I have a set of strings on which I am performing some operations to retrieve only the part of the string that I want for further operations.

I have used various substr, instr, regexp_replace etc and achieved for some of the strings in a huge list. I initially found that the string is separated using '-'(underscore) characters and the part of string I want starts after third instance of the delimeter.

But, when I researched further the format became clumsy and the operations I did are not giving the expected output for all the type of strings in the columns.

Please find the attached Oracle(11g) SQL code below and help me in optimizing it to retrieve the expected output every time regardless of the string format.

SELECT 'ABCD-012345-EFG-10vXRI47HU-1' AS MAIN_STRING,
(TRIM(REGEXP_REPLACE(SUBSTR('ABCD-012345-EFG-10vXRI47HU-1',INSTR('ABCD-012345-EFG-10vXRI47HU-1','-',1,3)+1),'[^0-9A-Za-z]',''))) AS FUNCTION_OUTPUT,
'10vXRI47HU' AS EXPECTED_OUTPUT FROM DUAL
UNION ALL
SELECT 'ABCD-012345-EFG-10zSD0U/2' AS MAIN_STRING,
(TRIM(REGEXP_REPLACE(SUBSTR('ABCD-012345-EFG-10zSD0U/2',INSTR('ABCD-012345-EFG-10zSD0U/2','-',1,3)+1),'[^0-9A-Za-z]',''))) AS FUNCTION_OUTPUT,
'10zSD0U' AS EXPECTED_OUTPUT FROM DUAL
UNION ALL
SELECT 'ABCD-012345-EFG-10ZsE8h -1' AS MAIN_STRING,
(TRIM(REGEXP_REPLACE(SUBSTR('ABCD-012345-EFG-10ZsE8h -1',INSTR('ABCD-012345-EFG-10ZsE8h -1','-',1,3)+1),'[^0-9A-Za-z]',''))) AS FUNCTION_OUTPUT,
'10ZsE8h' AS EXPECTED_OUTPUT FROM DUAL
UNION ALL
SELECT 'ABCD-012345-EFG- 10zSe9K ' AS MAIN_STRING,
(TRIM(REGEXP_REPLACE(SUBSTR('ABCD-012345-EFG- 10zSe9K ',INSTR('ABCD-012345-EFG- 10zSe9K ','-',1,3)+1),'[^0-9A-Za-z]',''))) AS FUNCTION_OUTPUT,
'10zSe9K' AS EXPECTED_OUTPUT FROM DUAL
UNION ALL
SELECT 'ABCD-012345-EFG-.10zSe9K' AS MAIN_STRING,
(TRIM(REGEXP_REPLACE(SUBSTR('ABCD-012345-EFG-.10zSe9K',INSTR('ABCD-012345-EFG-.10zSe9K','-',1,3)+1),'[^0-9A-Za-z]',''))) AS FUNCTION_OUTPUT,
'10zSe9K' AS EXPECTED_OUTPUT FROM DUAL
UNION ALL
SELECT 'ABCD-012345-EFG-10zSe9K_2' AS MAIN_STRING,
(TRIM(REGEXP_REPLACE(SUBSTR('ABCD-012345-EFG-10zSe9K_2',INSTR('ABCD-012345-EFG-10zSe9K_2','-',1,3)+1),'[^0-9A-Za-z]',''))) AS FUNCTION_OUTPUT,
'10zSe9K' AS EXPECTED_OUTPUT FROM DUAL
UNION ALL
SELECT 'ABCD-012345-EFG-10zSe9K.' AS MAIN_STRING,
(TRIM(REGEXP_REPLACE(SUBSTR('ABCD-012345-EFG-10zSe9K.',INSTR('ABCD-012345-EFG-10zSe9K.','-',1,3)+1),'[^0-9A-Za-z]',''))) AS FUNCTION_OUTPUT,
'10zSe9K' AS EXPECTED_OUTPUT FROM DUAL
UNION ALL
SELECT 'ABCD-012345--EFG-10zSe9K' AS MAIN_STRING,
(TRIM(REGEXP_REPLACE(SUBSTR('ABCD-012345--EFG-10zSe9K',INSTR('ABCD-012345--EFG-10zSe9K','-',1,3)+1),'[^0-9A-Za-z]',''))) AS FUNCTION_OUTPUT,
'10zSe9K' AS EXPECTED_OUTPUT FROM DUAL
UNION ALL
SELECT 'ABCD-012345-EFG-10zSe9K-' AS MAIN_STRING,
(TRIM(REGEXP_REPLACE(SUBSTR('ABCD-012345-EFG-10zSe9K-',INSTR('ABCD-012345-EFG-10zSe9K-','-',1,3)+1),'[^0-9A-Za-z]',''))) AS FUNCTION_OUTPUT,
'10zSe9K' AS EXPECTED_OUTPUT FROM DUAL
UNION ALL
SELECT 'ABCD-012345-EFG-10zSe9K-//1/23h' AS MAIN_STRING,
(TRIM(REGEXP_REPLACE(SUBSTR('ABCD-012345-EFG-10zSe9K-//1/23h',INSTR('ABCD-012345-EFG-10zSe9K-//1/23h','-',1,3)+1),'[^0-9A-Za-z]',''))) AS FUNCTION_OUTPUT,
'10zSe9K' AS EXPECTED_OUTPUT FROM DUAL
UNION ALL
SELECT 'ABCD-EFGH-HIJK-012345-10zSe9K' AS MAIN_STRING,
(TRIM(REGEXP_REPLACE(SUBSTR('ABCD-EFGH-HIJK-012345-10zSe9K',INSTR('ABCD-EFGH-HIJK-012345-10zSe9K','-',1,3)+1),'[^0-9A-Za-z]',''))) AS FUNCTION_OUTPUT,
'10zSe9K' AS EXPECTED_OUTPUT FROM DUAL

The output of the above query looks as below:

MAIN_STRING                     FUNCTION_OUTPUT   EXPECTED_OUTPUT
ABCD-012345-EFG-10vXRI47HU-1    10vXRI47HU1       10vXRI47HU
ABCD-012345-EFG-10zSD0U/2       10zSD0U2          10zSD0U   
ABCD-012345-EFG-10ZsE8h -1      10ZsE8h1          10ZsE8h   
ABCD-012345-EFG- 10zSe9K        10zSe9K           10zSe9K   
ABCD-012345-EFG-.10zSe9K        10zSe9K           10zSe9K   
ABCD-012345-EFG-10zSe9K_2       10zSe9K2          10zSe9K   
ABCD-012345-EFG-10zSe9K.        10zSe9K           10zSe9K   
ABCD-012345--EFG-10zSe9K        EFG10zSe9K        10zSe9K   
ABCD-012345-EFG-10zSe9K-        10zSe9K           10zSe9K   
ABCD-012345-EFG-10zSe9K-//1/23h 10zSe9K123h       10zSe9K   
ABCD-EFGH-HIJK-012345-10zSe9K   01234510zSe9K     10zSe9K   

You can observe there are differences between function_output and expected_output in some cases.

Note: The expected output might not always start with -10. It can be any alpha numeric value.

The only rule for this assignment is to extract only the expected output from the strings. The string formats can get too clumsy. I am trying to find a way to get the expected output everytime irrespective of the string format and delimeters available.

I have asked if the expected output can be of only certain number of characters everytime but I still did not get the answer for that.

But based on the data I have seen, it can start with 7 alpha numeric characters and go upto 12 alpha numeric characters. For that reason, I could not use a range of characters in the substring function.

So, I could not limit the substr function to certain length.

I would like to remove the differences and make the function_output match with expected_output every time.

Any help to achieve the expected output every time is highly appreciated.

Thanks In Advance.

Please let me know if you need anymore details.

venkatcg0
  • 1
  • 2
  • The INSTR function allows you to specify the occurrence of a string within a string so you can use this to find the 3rd delimiter and the SUBSTR from that point until the end of the string – NickW Dec 11 '22 at 07:40
  • @NickW, I have done the same but in some instances, I have other characters to be removed which is causing the output to be different from the expected output. Please go through the attached SQL query and the output for better understanding of the data I have to process – venkatcg0 Dec 11 '22 at 07:48
  • Are the rules you need to follow that you want the substring from the 3rd instance of the delimiter to the end of the string? If they are then that’s what my comment tells you how to do. If they are not the rules then please update your question to clearly state what all the rules are that need to be followed - examples may help to illustrate those rules but you do need to actually state the rules first – NickW Dec 11 '22 at 08:48
  • `regexp_substr(main_string,'[^-_./ ]+', 1, 4)` works for all but the last example. What is the rule exactly? – William Robertson Dec 11 '22 at 15:29
  • Hi @williamrobertson, thank you for the help. The assignment is to extract the expected output from the given strings. There is no rule on how the source strings might look. I have dug up the strings and found the different combinations it might be available in. I have to make sure that the extracted output is of alphanumeric. The part of string that we need will not contain any special characters apart from alphanumeric characters. But it might contain special characters at the start and end . Also it can be attached with some other alphanumeric values after the closing special character. – venkatcg0 Dec 11 '22 at 18:06
  • If there is no rule then why do you expect '10zSe9K' for the last row? – William Robertson Dec 12 '22 at 19:38

2 Answers2

0

Not sure that there is no more rules than in the sample data and expected result. With data provided - this should work:

Select 
    MAIN_STRING,
    CASE 
        WHEN INSTR(E_4_A, '-', 1, 1) > 0 THEN SubStr(E_4_A, 1, InStr(E_4_A, '-', 1, 1) - 1 ) 
        WHEN INSTR(E_4_A, '/', 1, 1) > 0 THEN SubStr(E_4_A, 1, InStr(E_4_A, '/', 1, 1) - 1 )
        WHEN INSTR(E_4_A, '_', 1, 1) > 0 THEN SubStr(E_4_A, 1, InStr(E_4_A, '_', 1, 1) - 1 )
    ELSE E_4_A END "E_4"
From
    ( SELECT 
        REPLACE(REPLACE(REPLACE(MAIN_STRING, '--', '-'), ' ', ''), '.', '') "MAIN_STRING", 
        SubStr(REPLACE(REPLACE(REPLACE(MAIN_STRING, '--', '-'), ' ', ''), '.', ''), InStr(REPLACE(REPLACE(REPLACE(MAIN_STRING, '--', '-'), ' ', ''), '.', ''), '-10', 1, 1) + 1) "E_4_A" 
      FROM  strings  )
/* R e s u l t :
MAIN_STRING                     E_4                           
------------------------------- -------------------------------
ABCD-012345-EFG-10vXRI47HU-1    10vXRI47HU                      
ABCD-012345-EFG-10zSD0U/2       10zSD0U                         
ABCD-012345-EFG-10ZsE8h-1       10ZsE8h                         
ABCD-012345-EFG-10zSe9K         10zSe9K                         
ABCD-012345-EFG-10zSe9K         10zSe9K                         
ABCD-012345-EFG-10zSe9K_2       10zSe9K                         
ABCD-012345-EFG-10zSe9K         10zSe9K                         
ABCD-012345-EFG-10zSe9K         10zSe9K                         
ABCD-012345-EFG-10zSe9K-        10zSe9K                         
ABCD-012345-EFG-10zSe9K-//1/23h 10zSe9K                         
ABCD-EFGH-HIJK-012345-10zSe9K   10zSe9K                     
*/

.. the above code key to look for is '-10' and, when found, get ridd of what you don't want to keep.
... if you want to split it just by '-' then try it this way.

Select 
    MAIN_STRING,
    RTRIM(
            CASE 
                WHEN INSTR(E_4_A, '-', 1, 1) > 0 And Length(SubStr(E_4_A, INSTR(E_4_A, '-', 1, 1) + 1)) <= 5 THEN SubStr(E_4_A, 1, InStr(E_4_A, '-', 1, 1) - 1 ) 
                WHEN INSTR(E_4_A, '-', 1, 1) > 0 And Length(SubStr(E_4_A, INSTR(E_4_A, '-', 1, 1) + 1)) > 5 THEN SubStr(E_4_A, InStr(E_4_A, '-', 1, 1) + 1 ) 
                WHEN INSTR(E_4_A, '/', 1, 1) > 0 THEN SubStr(E_4_A, 1, InStr(E_4_A, '/', 1, 1) - 1 )
                WHEN INSTR(E_4_A, '_', 1, 1) > 0 THEN SubStr(E_4_A, 1, InStr(E_4_A, '_', 1, 1) - 1 )
            ELSE E_4_A 
            END, '-') "E_4"
From
    ( SELECT 
        REPLACE(REPLACE(REPLACE(MAIN_STRING, '--', '-'), ' ', ''), '.', '') "MAIN_STRING", 
        SubStr(REPLACE(REPLACE(REPLACE(REPLACE(MAIN_STRING, '--', '-'), ' ', ''), '.', ''), '//', ''), InStr(REPLACE(REPLACE(REPLACE(REPLACE(MAIN_STRING, '--', '-'), ' ', ''), '.', ''), '//', ''), '-', 1, 3) + 1) "E_4_A" 
      FROM  strings  )
d r
  • 3,848
  • 2
  • 4
  • 15
  • Hi, thank you for the answer. The expected output might not always start with 10. Sorry for not mentioning this in the question. It can start with any alphanumeric value. Thank you for the effort you have put in for helping me. – venkatcg0 Dec 11 '22 at 18:13
  • @venkatcg0 I have a version with just '-' as separator. It is a bit more code but it works. Will post it in the morning if there were no other answer.... – d r Dec 11 '22 at 20:57
0

How about this:

with test_data (main_string, expected_output) as (
     select 'ABCD-012345-EFG-10vXRI47HU-1',   '10vXRI47HU'from dual union all
     select 'ABCD-012345-EFG-10zSD0U/2',       '10zSD0U'from dual union all
     select 'ABCD-012345-EFG-10ZsE8h -1',      '10ZsE8h'from dual union all
     select 'ABCD-012345-EFG- 10zSe9K ',       '10zSe9K'from dual union all
     select 'ABCD-012345-EFG-.10zSe9K',        '10zSe9K'from dual union all
     select 'ABCD-012345-EFG-10zSe9K_2',       '10zSe9K'from dual union all
     select 'ABCD-012345-EFG-10zSe9K.',        '10zSe9K'from dual union all
     select 'ABCD-012345--EFG-10zSe9K',        '10zSe9K'from dual union all
     select 'ABCD-012345-EFG-10zSe9K-',        '10zSe9K'from dual union all
     select 'ABCD-012345-EFG-10zSe9K-//1/23h', '10zSe9K'from dual union all
     select 'ABCD-EFGH-HIJK-012345-10zSe9K',   '10zSe9K'from dual
)
select main_string
     , expected_output
     , regexp_replace
       ( regexp_replace(main_string,'^(([A-Z]+[ -.]+)|([0-9]+[ -.]+))+')
       , '(\W|_).*$' ) as my_output
from   test_data;

The inner regex_replace strips any runs of all-characters or all-numbers followed by spaces, hyphens or dots from the start of each string.

The outer regex_replace removes anything starting with a non-word character (\W) or underscore from the end. (You could also do it in one pass by constructing the pattern as three subexpressions and picking the second one, which might be fractionally more efficient for a large data set.)

DBFiddle
regex101 for inner replace
regex101 for outer replace example

William Robertson
  • 15,273
  • 4
  • 38
  • 44