3

I'm stumped about a "stack overflow" error--"out of stack space (application error code: 12246)--that I'm getting in BBEdit when I do a "replace all", searching for

(@article(((?!eprint|@article|@book).)*\r)*)pmid = {(.+)}((((?!eprint|@article|@book).)*\r)*(@|\r*\z))

and replacing with

\1eprinttype = {pubmed}, eprint = {\4}\5

I can use these same patterns manually, doing one-at-a-time find & replace, without any errors, even once the match no longer occurs. I can also avoid the error by working on smaller files.

I suspect that it's my inefficient and sloppy regex coding that's to blame, and would appreciate an expert's help in doing this more efficiently. I'm trying to locate all entries in a BibLaTeX bibliography that don't already have an eprint field, but which have a pmid field, and replace the pmid field with a corresponding e-print specification (using eprint and eprinttype).


Update: After some experimentation, I've found that a different approach is the only thing I can get to work. Searching for

(?(?=@article(.+\r)+eprint = {(.+\r)+}\r*)(?!)|(@article(.+\r)+)pmid = {(.+)}((.+\r)+}\r*))

and replacing with

\3eprinttype = {pubmed}, eprint = {\5}\6

does the trick. The only problem with this is the backreferences are fragile, but I can't get named backreferences to work in BBEdit.

orome
  • 45,163
  • 57
  • 202
  • 418
  • Exactly how many times does the string appear and how large is the file you're iterating through? –  Mar 31 '12 at 04:24
  • @RandolphWest: The error occurs regardless of how many times a match appears in the file. I can run it on a file in which no matches occur and get the same error. All that seems to matter is file size: about 6,000 lines does the trick. In a typical file there are probably fewer than 100 entries that match. – orome Mar 31 '12 at 04:29
  • 1
    That's fascinating. While I'd be the first to blame RegEx, you may be running into a bug. Maybe it's worth dropping a line to Bare Bones, even if you do resolve this another way. –  Mar 31 '12 at 04:31
  • Try modifying the expression, see if you can find some change that causes it to SO/not SO. This seems like a bug in the editor to me. – Qtax Mar 31 '12 at 11:51
  • If you want to fail, you can simply use `(?!)`. – Qtax Apr 01 '12 at 09:57

2 Answers2

3

It's probably catastrophic backtracking caused by this last part:

.)*\r)*(@|\r*\z))

If you break that down and simplify it, you essentially have a .*, a \r*, and another \r* right next to each other. Now picture a string of \r characters at the end of your input: How should each \r be distributed? Which of those little clauses will soak up each \r character? If you have \r\r\r\r\r, you could eat all five \rs with the .* part and none at all with the \r* parts...or, you can make up any number of permutations that will still match. Since the * is greedy, it will try to fill the .* up first, but if that fails, it has to keep trying permutations until one of them works. So it's probably hogging a bunch of your resources with unnecessary backtracking, until finally it crashes.

I'm not an expert on optimization techniques for regex, but I'd start there if I were you.

Update:

Check out the Wikipedia article on PCRE:

Unless the "NoRecurse" PCRE build option (aka "--disable-stack-for-recursion") is chosen, adequate stack space must be allocated to PCRE by the calling application or operating system. ... While PCRE's documentation cautions that the "NoRecurse" build option makes PCRE slower than the alternative, using it avoids entirely the issue of stack overflows.

So I think catastrophic backtracking is a good bet here. I'd try to solve it by tweaking your regex before changing the build options on PCRE.

Justin Morgan - On strike
  • 30,035
  • 12
  • 80
  • 104
  • Backtracking shouldn't cause SO, just takes forever to run. Why would more stack frames be allocated when backtracking? Doesn't make sense. Trying different permutations doesn't need more stack, just needs more time. Altho I have no idea how PCRE implements this, and I could be wrong. :P – Qtax Mar 31 '12 at 11:48
  • @Qtax - It would definitely depend on the implementation. But the link I posted specifically mentions stack overflow in pre-5.10 Perl. Since PCRE more or less aims at being a C port of Perl... – Justin Morgan - On strike Mar 31 '12 at 14:14
  • @Qtax - See my update; it looks like stack overflow is an issue in PCRE when it's optimized for speed. – Justin Morgan - On strike Mar 31 '12 at 14:22
  • @JustinMorgan: I've updated my question with a solution that works, using a different approach. – orome Mar 31 '12 at 18:12
0

Obviously this is some bug. But you could try changing the expression a bit. It's difficult to optimize the expression without knowing the requirements, but here's a guess:

(@article(?:(?:(?!eprint|@article|@book|pmid)[^\r])*+\r)*+)pmid = {([^\n\r]+)}((?:(?:(?!eprint|@article|@book)[^\r])*+\r)*(?:@|\r*\z))

Replace with:

\1eprinttype = {pubmed}, eprint = {\2}\3

BBEdit seems to use PCRE, unless it's (very) outdated the above expression should be compatible.

Qtax
  • 33,241
  • 9
  • 83
  • 121