SQL server 2008 patindex recursion

Question

I want to find the latest instance of an expression, then keep looking to find there a better match and then choose the best match.

The cell I am looking at is a repeatedly apended log with notes followed by the username and timestamp.

Example cell contents:

Starting the investigation.
JWAYNE entered the notes above on 08/12/1976 12:01

Taking over the case. Not a lot of progress recently.
CEASTWOOD entered the notes above on 03/14/2001 09:04

No wonder this case is not progressing, the whole town is covering up some shenanigans!
CEASTWOOD entered the notes above on 03/21/2001 05:23

Star command was right, this investigation has been tossed around like a hot potato for a long time!
BLIGHTYEAR entered the notes above on 08/29/2659 08:01

I am not an expert on database normal form rules but it is annoying that the entries are jammed together into one cell making my job of isolating and checking the notes for specific words, especially when the cell is duplicated for multiple rows until the investigation is closed which puts the notes from future phases into the note column of past events and on top of that the time stamps making a timestamp PATINDEX with even a few minute margin unreliable like this:

CaseID, Username,  Notes,             Phase, Timestamp
E18902, JWAYNE,    Starting....08:01, E1,    03/14/2001 09:13
E18902, CEASTWOOD, Starting....08:01, E2,    03/14/2001 09:13
E18902, CEASTWOOD, Starting....08:01, E3,    03/21/2001 05:34
E18902, BLIGHTYEAR,Starting....08:01, E4,    08/29/2659 07:58

Right now I am doing a reverse on the whole string then a patindex to find the username then substringing to select only the note for that phase of the investigation and the problem is when the same user enters notes for multiple phases my simple "look for the first match staring at the end of the string moving to the top" picks up the wrong entry. My first thought is to search for the username and then check again to see if an entry further up is a better match (note time stamp vs column time stamp) but I am not sure how to code that...

Do i have to get into complicated string splits or is there a more simple solution?

This sounds like it could be much, much better handled in a procedural language that does string manipulation, than in a query. What is the ultimate destination for your query? Can it involve a high-level language like C#? — Ann L., Sep 25 '16 at 15:33
A better match would going up the string further and finding another note by the same user but the Timestamp column time is closer to the timestamp listed at the end of the note entry. Basically the problem i have now is that my current solution is giving up at the first instance of the username it detects, I want to be able to compare detection and evaluate each and pick the best one based on how close the timestamps are. End use is an SSRS report, is there some options for a VB/C solution there? — jbird468, Sep 29 '16 at 01:07
Do the cell contents include line feeds/carriage returns, as you show? — Ann L., Sep 29 '16 at 20:13
unfortunately yes, most are one or more char(10)s, some have a char(13) with the char(10)s. I found this with a nested REPLACE for each char 10 and 13. — jbird468, Oct 06 '16 at 08:52
Are you still looking for a solution? I think I have one to suggest. — Ann L., Oct 22 '16 at 17:14

Ann L. · Accepted Answer · 2016-11-14T14:52:47.003

Here's my suggestion. This is for one record, but you can convert it to a user-defined table-valued function, if you like.

I'm going to use the example data you had above.

 declare @sourceText nvarchar(max)
    ,    @workText   nvarchar(max)
    ,    @xml        xml

 set @sourceText = <your example text in your question>
 set @workText = @sourceText

 -- We're going to replace all the carriage returns and line feeds with 
 -- characters unlikely to appear in your text.  (If they are, use some
 -- other character.)

 set    @workText = REPLACE(@workText, char(10), '|')
 set    @workText = REPLACE(@workText, char(13), '|')

 -- Now, we're going to turn your text into XML.  Our first target is 
 -- the string of four "|" characters that the blank lines between entries
 -- will be turned into.  (If you've got 3, or 6, or blanks in between, 
 -- adjust accordingly.)

set @workText = REPLACE(@workText, '||||', '</line></entry><entry><line>')

-- Now we replace every other "|".  
set @workText = REPLACE(@workText, '|', '</line><line>')

-- Now we construct the rest of the XML and convert the variable to an 
-- actual XML variable.
set @workText = '<entry><line>' + @workText + '</line></entry>'
set @workText = REPLACE(@workText, '<line></line>','') -- Get rid of any empty nodes.

set @xml = CONVERT(xml, @workText)

We should now have an XML fragment that looks like this. (You can see it if you insert select @xml into the SQL at this point.)

<entry>
  <line>Starting the investigation.</line>
  <line>JWAYNE entered the notes above on 08/12/1976 12:01</line>
</entry>
<entry>
  <line>Taking over the case. Not a lot of progress recently.</line>
  <line>CEASTWOOD entered the notes above on 03/14/2001 09:04</line>
</entry>
<entry>
  <line>No wonder this case is not progressing, the whole town is covering up some shenanigans!</line>
  <line>CEASTWOOD entered the notes above on 03/21/2001 05:23</line>
</entry>
<entry>
  <line>Star command was right, this investigation has been tossed around like a hot potato for a long time!</line>
  <line>BLIGHTYEAR entered the notes above on 08/29/2659 08:01</line>
</entry>

We can now transform this XML into XML we like better:

  set @xml = @xml.query(
  'for $entry in /entry
    return <entry><data>
    {
    for $line in $entry/line[position() < last()] 
    return string($line)
    }
    </data>
    <timestamp>{ data($entry/line[last()]) }</timestamp>     
 </entry>
 ')

This gives us XML that looks like this (just one entry shown, for length reasons):

<entry>
    <data>Starting the investigation.</data>
    <timestamp>JWAYNE entered the notes above on 08/12/1976 12:01</timestamp>
</entry>

You can convert this back to tabular data with this query:

select  EntryData = R.lines.value('data[1]', 'nvarchar(max)')
    ,   EntryTimestamp = R.lines.value('timestamp[1]', 'nvarchar(MAX)')
from    @xml.nodes('/entry') as R(lines)

... and get data that looks like this.

And from there, you can do whatever you need to do.

Will this work with multi line note entries? Most of the actual data is made of one or more lines of notes before the username footer. — jbird468, Nov 14 '16 at 03:57
Yes, it should. The `return string($line)` in the transform should merge all the `` elements other than the last one into one `` tag. — Ann L., Nov 14 '16 at 14:51
That works very well for the example data I submitted with the question! I am now running into cases where there are inconsistent numbers of newlines between the timestamps. — jbird468, Nov 18 '16 at 06:13
Yeah, I can see how that would be a problem! My solution does assume that there's a reliable, consistent way to tell that one entry has stopped and another one started. — Ann L., Nov 18 '16 at 17:52

SQL server 2008 patindex recursion

1 Answers1