(SQL) Identify positions of multiple occurrences of a string format within a field

Question

I need to split a narrative field (free text) into multiple rows. Format is currently along the lines of:

Case_Reference | Narrative
```````````````|`````````````````````````````````````
XXXX/XX-123456 | [Endless_Text up to ~50k characters]

Within the narrative field as text, individual entries (when various agents have done something to the case) begin with the entry date followed by two spaces (i.e. 'dd/mm/yyyy '), with the values of the dates changing with each entry within that same field.

In other words, after trawling for a better delimiter, the only one I can use is this format of string, so I need to identify multiple positions within the Narrative text where the format (would mask be a better word?) matches 'dd/mm/yyyy '.

I can identify multiple occurrences of a consistent string no problem, but it's identifying it where I'm essentially looking for:

'%[0-9][0-9]/[0-9][0-9]/[0-9][0-9][0-9][0-9] %'

PATINDEX of course returns the first occurrence/position of this, but so far as I'm aware, there's no way to "modify" this (i.e. a created function) to allow for picking up the rest of the occurrences/positions of this they way we can with CHARINDEX (since PATINDEX doesn't have a starting position parameter).

For clarity, I'm not looking for code to delimit this directly as I need to further manipulate each entry, so it's purely the positions of multiple occurrences of the string within the Narrative text I'm looking for.

Any help would be very much appreciated.

For clarity, there's no option to do this pre-import, so it needs to be done on this landed data.

Desired output would be

 Case_Reference1 | 1st_Position_of_Delimiter_String  
 Case_Reference1 | 2nd_Position_of_Delimiter_String  
 Case_Reference2 | 1st_Position_of_Delimiter_String  
 Case_Reference2 | 2nd_Position_of_Delimiter_String  
 Case_Reference2 | 3rd_Position_of_Delimiter_String

What is the desired output? You have currently told us the route you want to take but not the finished product. There may be a better way than the route you are currently engaged in. — iamdave, Nov 22 '16 at 10:11
Best way to do this is during data ingestion using programming language, If you want to do the same for the existing data then you might end up with PATIndex only... — Kannan Kandasamy, Nov 22 '16 at 10:12
@Kannan Thank you, unfortunately this is not an option due to business constraints. — Lorielus, Nov 22 '16 at 10:46
@iamdave - thank you for formatting the post, much appreciated. The output desired is simply the case reference with each position of the string against it (post edited to demonstrate). — Lorielus, Nov 22 '16 at 10:48

Shnugo · Accepted Answer · 2016-11-22T11:22:19.147

You might solve this with an recursive CTE

DECLARE @tbl TABLE (Case_Reference NVARCHAR(MAX),Narrative NVARCHAR(MAX));
INSERT INTO @tbl VALUES
 (N'C1',N'01/02/2000  Some text with     blanks 02/03/2000  More text 03/04/2000  An even more')
,(N'C2',N'01/02/2000  Test for C2 02/03/2000  One more for C2 03/04/2000  An even more 04/05/2000  Blah')
,(N'C3',N'01/02/2000  Test for C3 02/03/2000  One more for C3 03/04/2000  An even more')
 ;

WITH recCTE AS
(
    SELECT 1 AS Step,Case_Reference,Narrative,CAST(1 AS BIGINT) AS StartsAt,NewPos.EndsAt+10 AS EndsAt,LEN(Narrative) AS MaxLen
          ,SUBSTRING(Narrative,NewPos.EndsAt+10+1,999999) AS RestString
    FROM @tbl AS tbl
    CROSS APPLY(SELECT PATINDEX('%[0-3][0-9]/[0-1][0-9]/[1-2][0-9][0-9][0-9]  %',SUBSTRING(Narrative,12,9999999))) AS NewPos(EndsAt)

    UNION ALL

    SELECT r.Step+1,r.Case_Reference,r.Narrative,r.EndsAt+1,CASE WHEN NewPos.EndsAt>0 THEN r.EndsAt+NewPos.EndsAt+10 ELSE r.MaxLen END,r.MaxLen
          ,SUBSTRING(r.RestString,NewPos.EndsAt+10+1,999999) 
    FROM recCTE AS r
    CROSS APPLY(SELECT PATINDEX('%[0-3][0-9]/[0-1][0-9]/[1-2][0-9][0-9][0-9]  %',SUBSTRING(r.RestString,12,99999999))) AS NewPos(EndsAt)
    WHERE r.EndsAt<r.MaxLen
)
SELECT Step,Case_Reference,StartsAt,EndsAt
      ,SUBSTRING(Narrative,StartsAt,EndsAt-StartsAt+1) AS OutputString 
FROM recCTE

ORDER BY Case_Reference,Step

The result

+------+----------------+----------+--------+---------------------------------------+
| Step | Case_Reference | StartsAt | EndsAt | OutputString                          |
+------+----------------+----------+--------+---------------------------------------+
| 1    | C1             | 1        | 38     | 01/02/2000  Some text with     blanks |
+------+----------------+----------+--------+---------------------------------------+
| 2    | C1             | 39       | 60     | 02/03/2000  More text                 |
+------+----------------+----------+--------+---------------------------------------+
| 3    | C1             | 61       | 84     | 03/04/2000  An even more              |
+------+----------------+----------+--------+---------------------------------------+
| 1    | C2             | 1        | 24     | 01/02/2000  Test for C2               |
+------+----------------+----------+--------+---------------------------------------+
| 2    | C2             | 25       | 52     | 02/03/2000  One more for C2           |
+------+----------------+----------+--------+---------------------------------------+
| 3    | C2             | 53       | 77     | 03/04/2000  An even more              |
+------+----------------+----------+--------+---------------------------------------+
| 4    | C2             | 78       | 93     | 04/05/2000  Blah                      |
+------+----------------+----------+--------+---------------------------------------+
| 1    | C3             | 1        | 24     | 01/02/2000  Test for C3               |
+------+----------------+----------+--------+---------------------------------------+
| 2    | C3             | 25       | 52     | 02/03/2000  One more for C3           |
+------+----------------+----------+--------+---------------------------------------+
| 3    | C3             | 53       | 76     | 03/04/2000  An even more              |
+------+----------------+----------+--------+---------------------------------------+

Lorielus note Shnugo's refinement of the pattern also, should help avoid false matches — James Casey, Nov 22 '16 at 11:41
Thank you very much Shnugo & James - if I could send a beer through the internet I would! — Lorielus, Nov 22 '16 at 15:49

score 2 · Answer 2 · answered Nov 22 '16 at 11:20

Try this recursive cte

declare @t table 
(
    caseref varchar(20),
    narrative varchar(max)
)

insert into @t values('Case_Reference1', 'blah 10/11/2016  something 13/11/2016  something else');
insert into @t values('Case_Reference2', '11/11/2016  something 12/11/2016  something else 14/11/2016  something yet still');
insert into @t values('Case_Reference3', 'should find nothing');

with cte (caseref, pos, remainingstr) as 
(
    select caseref, 
        patindex('%[0-9][0-9]/[0-9][0-9]/[0-9][0-9][0-9][0-9]  %', narrative),
        substring(narrative, patindex('%[0-9][0-9]/[0-9][0-9]/[0-9][0-9][0-9][0-9]  %', narrative) + 12, len(narrative) - 12 - patindex('%[0-9][0-9]/[0-9][0-9]/[0-9][0-9][0-9][0-9]  %', narrative))
    from @t
    where patindex('%[0-9][0-9]/[0-9][0-9]/[0-9][0-9][0-9][0-9]  %', narrative) > 0

    union all

    select caseref,
        pos + 12 + patindex('%[0-9][0-9]/[0-9][0-9]/[0-9][0-9][0-9][0-9]  %', remainingstr),
        substring(remainingstr, patindex('%[0-9][0-9]/[0-9][0-9]/[0-9][0-9][0-9][0-9]  %', remainingstr) + 12, len(remainingstr) - 12 - patindex('%[0-9][0-9]/[0-9][0-9]/[0-9][0-9][0-9][0-9]  %', remainingstr))
    from cte
    where patindex('%[0-9][0-9]/[0-9][0-9]/[0-9][0-9][0-9][0-9]  %', remainingstr) > 0

)
select caseref, pos
from cte
order by caseref, pos

Great, you followed quite the same approach as I did, +1 from my side. One hint: I appreciate the possibilty to put repeating expressions (like the `patindex` in `CROSS APPLY(SELECT ...)` and use them with speaking names. — Shnugo, Nov 22 '16 at 11:28

(SQL) Identify positions of multiple occurrences of a string format within a field

2 Answers2

Linked