Improving performance for a TM simulator

Question

I am trying to simulate a lot of 2 state, 3 symbol (One direction tape) Turing machines. Each simulation will have different input, and will run for a fixed number of steps. The current bottleneck in the program seems to be the simulator, taking a ton of memory on Turing machines which do not halt.

The task is to simulate about 650000 TMs, each with about 200 non-blank inputs. The largest number of steps I am trying is 1 billion (10**9).

Below is the code I am running. vector<vector<int> > TM is a transition table.

vector<int> fast_simulate(vector<vector<int> > TM, string TM_input, int steps) {
    /* Return the state reached after supplied steps */

    vector<int> tape = itotape(TM_input);

    int head = 0;
    int current_state = 0;
    int halt_state = 2;

    for(int i = 0; i < steps; i++){

        // Read from tape
        if(head >= tape.size()) {
            tape.push_back(2);
        }
        int cell = tape[head];
        int data = TM[current_state][cell];  // get transition for this state/input

        int move = data % 2;
        int write = (data % 10) % 3;
        current_state = data / 10;

        if(current_state == halt_state) {
            // This highlights the last place that is written to in the tape
            tape[head] = 4;
            vector<int> res = shorten_tape(tape);
            res.push_back(i+1);
            return res;
        }

        // Write to tape
        tape[head] = write;

        // move head
        if(move == 0) {
            if(head != 0) {
                head--;
            }
        } else {
            head++;
        }
    }

    vector<int> res {-1};
    return res;
}

vector<int> itotape(string TM_input) {
    vector<int> tape;
    for(char &c : TM_input) {
        tape.push_back(c - '0');
    }
    return tape;
}

vector<int> shorten_tape(vector<int> tape) {
    /*  Shorten the tape by removing unnecessary 2's (blanks) from the end of it.
    */
    int i = tape.size()-1;
    for(; i >= 0; --i) {
        if(tape[i] != 2) {
            tape.resize(i+1);
            return tape;
        }
    }
    return tape;
}

Is there anywhere I can make improvements in terms of performance or memory usage? Even a 2% decrease would make a noticeable difference.

You're passing some potentially huge `vector`s by value, thus, actively copying them around. — ForceBru, Mar 28 '17 at 16:27
Also, you're constantly `push_back`ing in `itotape`, thus altering the size of `tape` loads of times per second, which is quite expensive. Note that `TM_input` is a string, whose size you know, so you could allocate enough memory at once. — ForceBru, Mar 28 '17 at 16:33
this question may be better suited for http://codereview.stackexchange.com/ (provided that it is working code). I am not sure if it classifies as too broad or opinion based, but for sure there is little chance that someone else will have the same question — 463035818_is_not_an_ai, Mar 28 '17 at 16:36
@ForceBru good point, I can pass the tape by reference. I don't know the final size of the tape, so is there any point in asking for the length of the string of memory if ithe tape going to grow immediately after words? — spyr03, Mar 28 '17 at 16:42
@spyr03, `TM_input` is a `std::string`, and it does already 'know' its size. I'm trying to say that you already know that size in `itotape`, so you can allocate enough memory for `tape` at once and then populate it with data. Currently you're constantly increasing the size of `tape` by successively allocating memory, which wastes a lot of time. — ForceBru, Mar 28 '17 at 16:45
What is the average number of steps per one run of simulation? — stgatilov, Mar 28 '17 at 17:19
@stgatilov the simulations have two common results, halting in less than 100 steps and running for the full 1 billion steps. I don't know which is more common, but the full 1 billion steps would eat up most of the time. — spyr03, Mar 28 '17 at 18:25

stgatilov · Accepted Answer · 2017-03-29T03:07:56.133

1

Make sure no allocations happen during the whole TM simulation.

Preallocate a single global array at program startup, which is big enough for any state of the tape (e.g. 10^8 elements). Put the machine at the beginning of this tape array initially. Maintain the segment [0; R] of the all cells which were visited by the current machine simulation: this allows you to avoid clearing the whole tape array when you start the new simulation.

Use the smallest integer type for tape elements which is enough (e.g. use unsigned char if the alphabet surely has less than 256 characters). Perhaps you can even switch to bitsets if alphabet is very small. This reduces memory footprint and improves cache/RAM performance.

Avoid using generic integer divisions in the innermost loop (they are slow), use only divisions by powers-of-two (they turn into bit shifts). As the final optimization, you may try to remove all branches from the innermost loop (there are various clever techniques for this).

edited Mar 29 '17 at 03:07

answered Mar 28 '17 at 16:47

stgatilov

5,333
31
54

Why is "zeroing" the array after each simulation not slower than allocating memory? Changing the primitive in the array sounds like a really good idea to save memory :) What should I google to find the techniques for avoiding branching? – spyr03 Mar 28 '17 at 17:03
An allocation is faster than zeroing, but allocations don't zero. So, if you allocate every pass, you must do both (assuming you also want to zero out the allocated memory) – Donnie Mar 28 '17 at 17:04
Note that my answer suggests to NEVER reallocate anything, and zero only the really USED part of the tape. This is the minimal effort necessary to initialize the new tape. You can also memcpy your input string first, and then zero only the remaining part of the tape (used by the previous TM simulation). – stgatilov Mar 28 '17 at 17:06
Avoid `push_back` like the plague during simulation. If the vector has to grow, but there's not room, then the whole thing gets copied. This goes hand-in-hand with "no allocations", but I wanted to explain why it's potentially horrible. (Especially on a huge vector) – Donnie Mar 28 '17 at 17:11
For avoiding branches you need: [branchless min/max](https://graphics.stanford.edu/~seander/bithacks.html#IntegerMinOrMax). Note that the branching for check `state == halt` should be retained as is, since it is well predicatble (happens only once per simulation). The branch for `move == 0` should be removed. – stgatilov Mar 28 '17 at 17:13
I can't remember if intel has a branchless "less than" opcode, but you can do min/max without the < check in the expression. https://hbfs.wordpress.com/2008/08/05/branchless-equivalents-of-simple-functions/ – Donnie Mar 28 '17 at 19:45

stgatilov · Answer 2 · 2017-03-29T03:46:06.870

Here is another answer with more algorithmic approaches.

Simulation by blocks

Since you have tiny alphabet and tiny number of states, you can accelerate the simulation by processing chunks of the tape at once. This is related to the well-known speedup theorem, although I suggest a slightly different method.

Divide the tape into blocks of 8 characters each. Each such block can be represented with 16-bit number (2 bits per character). Now imagine that the machine is located either at the first or at the last character of a block. Then its subsequent behavior depends only on its initial state and the initial value on the block, until the TM moves out of the block (either to the left or to the right). We can precompute the outcome for all (block value + state + end) combinations, or maybe lazily compute them during simulation.

This method can simulate about 8 steps at once, although if you are unlucky it can do only one step per iteration (moving back and forth around block boundary). Here is the code sample:

//R = table[s][e][V] --- outcome for TM which:
//  starts in state s
//  runs on a tape block with initial contents V
//  starts on the (e = 0: leftmost, e = 1: rightmost) char of the block
//The value R is a bitmask encoding:
//  0..15 bits: the new value of the block
//  16..17 bits: the new state
//  18 bit: TM moved to the (0: left, 1: right) of the block
//  ??encode number of steps taken??
uint32_t table[2][2][1<<16];

//contents of the tape (grouped in 8-character blocks)
uint16_t tape[...];

int pos = 0;    //index of current block
int end = 0;    //TM is currently located at (0: start, 1: end) of the block
int state = 0;  //current state
while (state != 2) {
  //take the outcome of simulation on the current block
  uint32_t res = table[state][end][tape[pos]];
  //decode it into parts
  uint16_t newValue = res & 0xFFFFU;
  int newState = (res >> 16) & 3U;
  int move = (res >> 18);
  //write new contents to the tape
  tape[pos] = newValue;
  //switch to the new state
  state = newState;
  //move to the neighboring block
  pos += (2*move-1);
  end = !move;
  //avoid getting out of tape on the left
  if (pos < 0)
      pos = 0, move = 0;
}

Halting problem

The comment says that TM simulation is expected either to finish very early, or to run all the steps up to the predefined huge limit. Since you are going to simulate many Turing machines, it might be worth investing some time in solving the halting problem.

The first type of hanging which can be detected is: when machine stays at the same place without moving far away from it. Let's maintain surrounding of TM during simulation, which is the values of segment of characters at distance < 16 from TM's current location. If you have 3 characters, you can encode surrounding in a 62-bit number.

Maintain a hash table for each position of TM (as we'll see later, only 31 tables are necessary). After each step, store tuple (state, surrounding) in the hash table of current position. Now the important part: after each move, clear all hash tables at distance >= 16 from TM (actually, only one such hash table has to be cleared). Before each step, check if (state, surrounding) is already present in the hash table. If it is, then the machine is in infinite loop.

You can also detect another type of hanging: when machine moves to the right infinitely, but never returns back. In order to achieve that, you can use the same hashtables. If TM is located at the currently last character of the tape with index p, check current tuple (state, surrounding) not only in the p-th hashtable, but also in the (p-1)-th, (p-2)-th, ..., (p-15)-th hash tables. If you find a match, then TM is in infinite loop moving to the right.

Really good answer, simulating multiple steps at once was considered, but I had no idea how to actually go about doing it. Thank for the link too. Regarding looping TMs, I do a preliminary check, simulating the TM for ~10000 steps, and check if it reaches the same internal state more than once (head, current_state, tape) with a set. This did not catch the TMs that run right forever, so this will be a massive improvement. — spyr03, Mar 29 '17 at 10:30

Jason Lang · Answer 3 · 2017-03-28T17:14:01.100

0

Change

int move = data % 2;

To

int move = data & 1;

One is a divide, the other is a bitmask, both should give 0 or 1 base on the low bit. You can do this anytime you have % by a power of two.

You're also setting

cell = tape[head];
data = TM[current_state][cell]; 
int move = data % 2;
int write = (data % 10) % 3;
current_state = data / 10;

Every single step, regardless of whether tape[head] has changed and even on branches where you're not accessing those values at all. Take a careful look at which branches use which data, and only update things just as they're needed. See straight after that you write:

    if(current_state == halt_state) {
        // This highlights the last place that is written to in the tape
        tape[head] = 4;
        vector<int> res = shorten_tape(tape);
        res.push_back(i+1);
        return res;
    }

^ This code doesn't reference "move" or "write", so you can put the calculation for "move"/"write" after it and only calculate them if current_state != halt_state

Also the true-branch of an if statement is the optimized branch. By checking for not the halt state, and putting the halt condition in the else branch you can improve the CPU branch prediction a little.

edited Mar 28 '17 at 17:14

answered Mar 28 '17 at 17:01

Jason Lang

1,079
9
17

I could be wrong, but won't all compilers these days automatically rewrite the mod as an AND? That seems like the sort of thing that isn't likely to contribute much, since I doubt that's the bottleneck. – templatetypedef Mar 29 '17 at 02:59
the compiler won't make this optimization for you. That's because % and & have different behavior for negative numbers. % leaves the sign on, so %2 of -1 is -1, whereas & only returns the selected bits. If you need to know odd/even and you don't want to worry about negatives, %2 is not safe, and the compiler won't make this substitution on the off-chance that you pass a negative in there (which would mean it gave different results). – Jason Lang Mar 29 '17 at 03:27
btw i tested a loop of 2 billion %2 vs 2 billion &1. Visual C++ 2015, full speed optimizations. The result was a saving of 20 seconds. 55 seconds for the loop of %2 and 33 seconds for the loop of &1. So if you run 1 billion steps and each step does a %2 you can save 10 seconds per run, just from replacing the %2 with &1. – Jason Lang Mar 29 '17 at 03:29
Huh, I was under the impression that in C++ the behavior of mod on negative values was implementation-defined and didn't have to round down that way. I'm honestly quite surprised! – templatetypedef Mar 29 '17 at 03:47
Well it might vary by implementation, so that's even more reason to prefer &1 for checking odd/evenness. I really like the Mike Acton videos about optimization, they get you thinking differently. The fact is many of the optimizations we do depend on domain-specific knowledge or assumptions. So a compiler cannot actually do them. People out too much faith in compiler magic rather than good data design. – Jason Lang Mar 29 '17 at 03:59
btw i misread a decimal point it wasn't a ratio of 33 to 55 it was a ratio of 3.3 to 55. So 1 billion &1 + additions takes about 1.5 seconds, but 20 seconds when you use %2. Knock about 20 seconds off for a 1-billion step run. – Jason Lang Mar 29 '17 at 04:22
That is really interesting! So if I used an unsigned numerical type, would the compiler be able to (consistently) replace %2 with &1? – spyr03 Mar 29 '17 at 10:32

Improving performance for a TM simulator

3 Answers3

Simulation by blocks

Halting problem