If we use this definition of tree equivalence:
T1 is equivalent to T2 iff
all paths to leaves of T1 exist exactly once in T2, and
all paths to leaves of T2 exist exactly once in T2
Hashing a sequence (a path) is straightforward. If h_tree(T)
is a hash of all paths-to-leafs of T, where the order of the paths does not alter the outcome, then it is a good hash for the whole of T, in the sense that equivalent trees will produce equal hashes, according to the above definition of equivalence. So I propose:
h_path(path) = an order-dependent hash of all elements in the path.
Requires O(|path|) time to calculate,
but child nodes can reuse the calculation of their
parent node's h_path in their own calculations.
h_tree(T) = an order-independent hashing of all its paths-to-leaves.
Can be calculated in O(|L|), where L is the number of leaves
In pseudo-c++:
struct node {
int path_hash; // path-to-root hash; only use for building tree_hash
int tree_hash; // takes children into account; use to compare trees
int content;
vector<node> children;
int update_hash(int parent_path_hash = 1) {
path_hash = parent_path_hash * PRIME1 + content; // order-dependent
tree_hash = path_hash;
for (node n : children) {
tree_hash += n.update_hash(path_hash) * PRIME2; // order-independent
}
return tree_hash;
}
};
After building two trees, update their hashes and compare away. Equivalent trees should have the same hash, different trees not so much. Note that the path and tree hashes that I am using are rather simplistic, and chosen rather for ease of programming than for great collision resistance...