Regexp_Substr to filter rows in SQL

Question

For the below table structure and data, I am trying to get rows that contain at least 1 date within the last 15 months for a column having string groups separated by a pattern.

CREATE TABLE TEMP(nbr NUMBER, dt VARCHAR2(4000));
INSERT INTO TEMP VALUES(1,'22/05/2016');
INSERT INTO TEMP VALUES(2,'22/05/2020##22/01/2020##22/10/2019');
INSERT INTO TEMP VALUES(3,'25/05/2020##22/07/2019##22/11/2019');
INSERT INTO TEMP VALUES(4,'25/05/2015##22/01/2017##22/06/2018');

The query that I have written is :

select nbr,dt from temp
WHERE MONTHS_BETWEEN(SYSDATE,TO_DATE(REGEXP_SUBSTR(dt,'[^##]+',1,LEVEL)))<15
CONNECT BY LEVEL<=REGEXP_COUNT(dt,'[^##]+');

For rows 2 and 3,the value of attribute dt is such that at least 1 of the substrings have a date less than 15 months old. Expected result is :

nbr    dt

2   22/05/2020##22/01/2020##22/10/2019

3   25/05/2020##22/07/2019##22/11/2019

The result that I am getting is:https://i.stack.imgur.com/rZ1tp.jpg

I am sure that this query is missing something because of which I am getting all those duplicate rows. This is the first time that I am attempting something of this kind. Can someone point out to me what is that I need to do to correct this?

Fix your data model! Don't store dates as strings! Don't store multiple values in a string column! — Gordon Linoff, Dec 21 '20 at 18:16
@GordonLinoff,Well I have obviously presented a relevant condensed view of the bigger table. The data came to table from an Excel spreadsheet, wherein data was stored in this manner. These multiple values resembled the dates on which item (identified by nbr column) was purchased . — halfwind22, Dec 21 '20 at 18:21

score 2 · Answer 1 · answered Dec 22 '20 at 01:18

You can simply use an EXISTS filter with a correlated hierarchical subquery:

SELECT *
FROM   temp t
WHERE  EXISTS (
         SELECT 1
         FROM   DUAL
         WHERE  TO_DATE( REGEXP_SUBSTR( t.dt, '[^#]+', 1, LEVEL ), 'DD/MM/YYYY' )
                  >= ADD_MONTHS( TRUNC(SYSDATE), -15 )
         CONNECT BY
                LEVEL <= REGEXP_COUNT( t.dt, '[^#]+' )
       )

Which, for your sample data, outputs:

NBR | DT                                
--: | :---------------------------------
  2 | 22/05/2020##22/01/2020##22/10/2019
  3 | 25/05/2020##22/07/2019##22/11/2019

db<>fiddle here

score 1 · Answer 2 · answered Dec 21 '20 at 19:45

I would be remiss if I did not mention that a proper data model is called for, and the data should be converted into that data model as it is brought in from the spreadsheet. This way you would not pull your hair out trying to jump through hoops to get what you are after like you are doing now. Sure you can get want you want, but with code that will be tough for someone else to maintain behind you. As an exercise I would encourage you to do just that even though you may get a solution that works for your current situation. See for yourself how taking the time to create a properly normalized data model and moving the spreadsheet data into that will enable you to create more efficient and easier to maintain code.

Ok, That said, consider this. The WITH clause here is used like a temp table to create a data set to select from. The query uses CONNECT BY to move through each element of the string, where an element is followed by the delimiter of '##' or the end of the line (using this form of regex handles NULL list elements). Notice I included LEVEL in the select list so you can see which element matched the condition. If you don't care about that, remove it and add a DISTINCT to the select statement to get the output you expect in your original post.

WITH tbl(item_nbr, purchase_date) AS (
  SELECT 1, '22/05/2016' FROM dual UNION ALL
  SELECT 2, '22/05/2020##22/01/2020##22/10/2019' FROM dual UNION ALL
  SELECT 3, '25/05/2020##22/07/2019##22/11/2019' FROM dual UNION ALL
  SELECT 4, '25/05/2015##22/01/2017##22/06/2018' FROM dual
)
SELECT --DISTINCT
       item_nbr, LEVEL AS ELEMENT, purchase_date
FROM tbl
WHERE MONTHS_BETWEEN(SYSDATE, TO_DATE(REGEXP_SUBSTR(purchase_date, '(.*?)(##|$)', 1, LEVEL, NULL, 1), 'DD/MM/YYYY' ) ) < 15
CONNECT BY LEVEL <= REGEXP_COUNT(purchase_date,'##') + 1
  AND PRIOR item_nbr = item_nbr
  AND PRIOR SYS_GUID() IS NOT NULL;


  ITEM_NBR    ELEMENT PURCHASE_DATE                     
---------- ---------- ----------------------------------
         2          1 22/05/2020##22/01/2020##22/10/2019
         2          2 22/05/2020##22/01/2020##22/10/2019
         2          3 22/05/2020##22/01/2020##22/10/2019
         3          1 25/05/2020##22/07/2019##22/11/2019
         3          3 25/05/2020##22/07/2019##22/11/2019

5 rows selected.

score 0 · Answer 3 · answered Dec 21 '20 at 23:46

I like to use standard recursive queries for this. That's a portable syntax, that works across may databases with slight variations, so an useful thing to learn. Another upside is that we can use simple string functions rather than a regular expression.

We can use the following recursive query to parse the string elements to a series of rows holding the corresponding date:

with cte (nbr, dt, dt_new, dt_rest) as (
    select nbr, dt, null, dt || '##' from temp
    union all
    select nbr, dt,
        to_date(
            substr(dt_rest, 1, instr(dt_rest, '##') - 1)
            default null on conversion error,
            'dd/mm/yyyy'
        ),
        substr(dt_rest, instr(dt_rest, '##') + 2) 
    from cte
    where instr(dt_rest, '##') > 0
)
select * from cte 
where dt_new is not null 
order by nbr, dt_new;

From there on, we can simply aggregate and look at the latest date per group:

with cte (nbr, dt, dt_new, dt_rest ) as (...)
select nbr, dt, max(dt_new) as max_dt_new
from cte
group by nbr, dt
having months_between(sysdate, max(dt_new)) < 15

Demo on DB Fiddle:

NBR | DT                                 | MAX_DT_NEW
--: | :--------------------------------- | :---------
  2 | 22/05/2020##22/01/2020##22/10/2019 | 22-MAY-20 
  3 | 25/05/2020##22/07/2019##22/11/2019 | 25-MAY-20

Regexp_Substr to filter rows in SQL

3 Answers3