How to recognize variables that don't affect the output of a program?

Question

Sometimes the value of a variable accessed within the control-flow of a program cannot possibly have any effect on a its output. For example:

global var_1
global var_2

start program hello(var_3, var_4)
    if (var_2 < 0) then
        save-log-to-disk (var_1, var_3, var_4)
    end-if
    return ("Hello " + var_3 + ", my name is " + var_1)
end program

Here only var_1 and var_3 have any influence on the output, while var_2 and var_4 are only used for side effects. Do variables such as var_1 and var_3 have a name in dataflow-theory/compiler-theory? Which static dataflow analysis techniques can be used to discover them?

References to academic literature on the subject would be particularly appreciated.

Assuming the compiler can distinguish these two classes of variables, what can it do with that information? I don't think you can argue in general that the call to `save-log-to-disk` is less important than the function result. — 500 - Internal Server Error, Aug 02 '15 at 00:08
@Internal Server Error: When considering a program, you often consider only certain inputs and outputs as interesting from a functionality point of view. The program may compute/do other things, but you don't care. Log files fit this category. — Ira Baxter, Jan 27 '16 at 04:55

score 1 · Answer 1 · answered Jan 27 '16 at 00:13

The problem that you stated is undecidable in general, even for the following very narrow special case: Given a single routine P(x), where x is a parameter of type integer. Is the output of P(x) independent of the value of x, i.e., does P(0) = P(1) = P(2) = ...?

We can reduce the following still undecidable version of the halting problem to the question above: Given a Turing machine M(), does the program never stop on the empty input?

I assume that we use a (Turing-complete) language in which we can build a "Turing machine simulator":

Given the program M(), construct this routine:

P(x):
    if x == 0:
       return 0
    Run M() for x steps
    if M() has terminated then:
        return 1
    else:
        return 0

Now:

P(0) = P(1) = P(2) = ... 
=> 
M() does not terminate.

M() does terminate 
=> P(x) = 1 for a sufficiently large x   
=> P(x) != P(0) = 0

So, it is very difficult for a compiler to decide whether a variable actually does not influence the return value of a routine; in your example, the "side effect routine" might manipulate one of its values (or even loop infinitely, which would most definitely change the return value of the routine ;-) Of course overapproximations are still possible. For example, one might conclude that a variable does not influence the return value if it does not appear in the routine body at all. You can also see some classical compiler analyses (like Expression Simplification, Constant propagation) having the side effect of eliminating appearances of such redundant variables.

score 1 · Answer 2 · answered Jan 27 '16 at 04:54

Pachelbel has discussed the fact that you cannot do this perfectly. OK, I'm an engineer, I'm willing to accept some dirt in my answer.

The classic way to answer you question is to do dataflow tracing from program outputs back to program inputs. A dataflow is the connection of a program assignment (or sideeffect) to a variable value, to a place in the application that consumes that value.

If there is (transitive) dataflow from a program output that you care about (in your example, the printed text stream) to an input you supplied (var2), then that input "affects" the output. A variable that does not flow from the input to your desired output is useless from your point of view.

If you focus your attention only the computations involved in the dataflows, and display them, you get what is generally called a "program slice" . There are (very few) commercial tools that can show this to you. Grammatech has a good reputation here for C and C++.

There are standard compiler algorithms for constructing such dataflow graphs; see any competent compiler book.

They all suffer from some limitation due to Turing's impossibility proofs as pointed out by Pachelbel. When you implement such a dataflow algorithm, there will be places that it cannot know the right answer; simply pick one.

If your algorithm chooses to answer "there is no dataflow" in certain places where it is not sure, then it may miss a valid dataflow and it might report that a variable does not affect the answer incorrectly. (This is called a "false negative"). This occasional error may be satisfactory if the algorithm has some other nice properties, e.g, it runs really fast on a millions of code. (The trivial algorithm simply says "no dataflow" in all places, and it is really fast :)

If your algorithm chooses to answer "yes there is a dataflow", then it may claim that some variable affects the answer when it does not. (This is called a "false positive").

You get to decide which is more important; many people prefer false positives when looking for a problem, because then you have to at least look at possibilities detected by the tool. A false negative means it didn't report something you might care about. YMMV.

Here's a starting reference: http://en.wikipedia.org/wiki/Data-flow_analysis Any of the books on that page will be pretty good. I have Muchnick's book and like it lot. See also this page: (http://en.wikipedia.org/wiki/Program_slicing)

You will discover that implementing this is pretty big effort, for any real langauge. You are probably better off finding a tool framework that does most or all this for you already.

score 0 · Answer 3 · answered Jan 28 '16 at 19:34

I use the following algorithm: a variable is used if it is a parameter or it occurs anywhere in an expression, excluding as the LHS of an assignment. First, count the number of uses of all variables. Delete unused variables and assignments to unused variables. Repeat until no variables are deleted.

This algorithm only implements a subset of the OP's requirement, it is horribly inefficient because it requires multiple passes. A garbage collection may be faster but is harder to write: my algorithm only requires a list of variables with usage counts. Each pass is linear in the size of the program. The algorithm effectively does a limited kind of dataflow analysis by elimination of the tail of a flow ending in an assignment.

For my language the elimination of side effects in the RHS of an assignment to an unused variable is mandated by the language specification, it may not be suitable for other languages. Effectiveness is improved by running before inlining to reduce the cost of inlining unused function applications, then running it again afterwards which eliminates parameters of inlined functions.

Just as an example of the utility of the language specification, the library constructs a thread pool and assigns a pointer to it to a global variable. If the thread pool is not used, the assignment is deleted, and hence the construction of the thread pool elided.

IMHO compiler optimisations are almost invariably heuristics whose performance matters more than effectiveness achieving a theoretical goal (like removing unused variables). Simple reductions are useful not only because they're fast and easy to write, but because a programmer using a language who understand basics of the compiler operation can leverage this knowledge to help the compiler. The most well known example of this is probably the refactoring of recursive functions to place the recursion in tail position: a pointless exercise unless the programmer knows the compiler can do tail-recursion optimisation.

How to recognize variables that don't affect the output of a program?

3 Answers3