2

This code, for matching a string in NFA, which I think requires O(N^2) memory, predictably breaks when string size is 20,000, then works with -O2 compiled code, then breaks again for -O3. Compilation was done with -std=c++14 enabled. In my opinion, the problem is stack-overflow.

Input string was "ab" repeated 10,000 times, plus a 'c' at end.The image below contains the NFA I'm trying to match.

Specifically, my question is -

1) What -O2 optimization is behind this,(which I believe is impressive) fix?

2) And what -O3 optimization breaks it again?

struct State
{
    map<char,vector<State*> > transitions;
    bool accepting = false;
};

bool match(State* state,string inp){
    if(inp=="") return state->accepting;

    for(auto s:state->transitions[inp[0]]) 
        if(match(s,inp.substr(1))) return true;

    for(auto s:state->transitions['|']) //e-transitions
        if(match(s,inp)) return true;

    return false;
}

In gcc documentation, it's said O3 has all optimizations of O2, plus some more. I couldn't "get" some of those extras or their relevance to this problem.And I want to emphasize, for what I've seen in similar questions, that I'm not looking for specific ways to fix this problem.

Th tested NFA

Shihab Shahriar Khan
  • 4,930
  • 1
  • 18
  • 26

1 Answers1

2

As you already have figured out: the problem is the stack-usage of your recursion. It is also true that TLO would not be performed neither for -O2 nor for -O3 (theoretically it would be possible only for the last recur-call which would not help in your case).

However, depending on the level of the optimization your function needs different amount of space on the stack. There is no guarantee that -O3 version will be faster and need less space on the stack.

When we look at the assembly we can see the the following:

  1. -O3 reserves 88 bytes via subq $88, %rsp, the footprint on the stack is even larger because also registers r12-r15 are pushed on the stack in addition to the usual function prologue.

  2. -O2 reserves only 56 bytes in addition to the registers pushed on the stack.

  3. Without optimization the footprint on the stack is the largest: everything needs to be stored/loaded to/from the stack between two lines of original code, in order to get predictable debug behavior so we can change values in debugger.

That would explain your observations: without optimization the stack is full pretty quickly. -O2 optimization mitigate it (but doesn't fix it), so recursion depth of 20000 can be handled - it will probably crash for 30000. -O3 optimization has a larger stack footprint and fails already for smaller inputs.

The proper fix for this problem is obvious now: one should either use the iterative version of depth first search or the breadth first search.

Another issue in your code - the usage of substrwhich results in unnecessary memory copying/usage. Just pass the iterators to the first character in the string and increment it for the recursion-call.

ead
  • 32,758
  • 6
  • 90
  • 153
  • should've thought of that. -O3 works for string upto 19k, O2 breaks at 22k in my pc. The gap is small. – Shihab Shahriar Khan Nov 03 '17 at 18:29
  • i have 3 questions. How hard is x86 assembly? where did you learn it from? and most importantly, for someone who won't necessarily use this knowledge in foreseeable future, is learning x86 worth it? I know the answer would be complicated, but plz try to answer the last one in boolean. – Shihab Shahriar Khan Nov 03 '17 at 18:32
  • @ShihabShahriar my understanding of x86 is very limited, so I'm not qualified to answer your first two questions. As for the third: assembly makes you a better programmer because it enables you to understand how the things really work - you don't have to guess, what the compiler did - you just can see it – ead Nov 03 '17 at 19:04