1

I have a large text file with some notes in TextWrangler that I want to parse with Regex and write the matches to a CSV file for MySQL import. Here is a sample source:

ARCHIVE

02.09.2014 22:35 

title1
content
content
content
content

30.08.2014 18:13 

title2
content
    content with tab
    content with tab
content

...
more notes as above
...

Each note starts with a date surrounded by returns, then a title and some content lines. I'm currently testing with the following Regex in the TW Find dialog with Grep checked to get the date, title and content block for each note:

\r(\d\d\.\d\d\.\d\d\d\d \d\d:\d\d)\s*\r\r(.+)(?s)((?:(?!\r\d\d\.\d\d\.\d\d\d\d \d\d:\d\d\s*\r).)*)

What this does is look for the date surrounded by returns, then captures the title line and finally all lines following provided that another date block is not encountered. The latter uses a non-capturing negative lookahead. Before the last step the DOTALL setting is enabled with (?s) including returns in the dot metacharacter.

With the sample source above the Find works for the first note but not for the second one, where some lines are indented with tabs. TW shows this error:

enter image description here

This is where I'm stuck. Can anyone give me a hint?

Timm
  • 2,488
  • 2
  • 22
  • 25
  • have you tried replacing your spaces in your expression with `\s`? – Mic1780 Sep 17 '14 at 23:44
  • Replacing the 2 spaces with `\s` makes both notes fail. – Timm Sep 18 '14 at 00:51
  • There's some questions about BBEdit crashing on complex Regex patterns (see http://stackoverflow.com/questions/9952957/why-is-my-search-in-bbedit-causing-a-stack-overflow-error). Maybe you can think of a way to make it simpler? – Timm Sep 18 '14 at 00:54

2 Answers2

0

I tested the pattern some more and found that the Regex/Grep failure is really not predictable.

It seems to be related to tabs in the source, but there may be other text that triggers the bug. For example I found that a working 'note' section that contained tabs started to fail when a web url was added.

I am using TextWrangler 3.5.3 on Mavericks 10.9.4 that was updated from Snow Leopard. I have had many obscure issues on this system also in Apple Mail and other apps, so I'm thinking the TW bug may be related to the Mavericks problem. The reason I'm using an older version of TW is that I don't like the sidebar on the left.

As I said in my comment there are SO questions on Grep issues with BBedit/TW, and these don't seem to arise from the PCRE Regex engine as such but rather from the BBedit code. Of course SO can't help on that.

Timm
  • 2,488
  • 2
  • 22
  • 25
0

Not allowed to comment, I've to 'answer'…

  • Your regex works nicely in TW 4.5.9 (on 10.9.5).
  • It works as well with the "non-capturing" bit dropped - in TW 4.5.9 (on 10.9.5). (At least for the matches, that is. Didn't verify its captures.)
  • In TW 4.5.9, web urls do not seem to cause a problem either.

Possibly

\r(\d{2}\.\d{2}\.\d{4} \d{2}:\d{2})\s*\r\r(.+\r)((.+\r)+)(?!\d{2}\.\d{2}\.\d{4} \d{2}:\d{2}\s*\r)

does serve your purpose (in TW 3.5.3 on 10.9.4 as well. Except for leading/trailing line breaks (which, of course, could be added if actually required) in TW 4.5.9 its captures seem to be identical to yours).

Abecee
  • 2,365
  • 2
  • 12
  • 20