How to calculate Hash value of a Tree

Question

What is the best way to calculate the hash value of a Tree?

I need to compare the similarity between several trees in O(1). Now, I want to precalculate the hash values and compare them when needed. But then I realized, hashing a tree is different than hashing a sequence. I wasn't able to come up with a good hash function.

What is the best way to calculate hash value of a tree?

Note : I will implement the function in c/c++

Similar trees will not neccessarily have similar hashes. If you want to check equality of trees, comparing hashes is fine, but most hashing solutions are not suitable for the calculation of similarities. — Markus Unterwaditzer, Aug 24 '13 at 17:03
Two trees T1 ans T2 are equivalent if, For any node r1 (r1 belongs to T1) and r2 (r2 belongs to T2), if we consider r1 the root of T1 and r2 the root of T2, the rest of the trees can be rearranged in such a way that T1 and T2 isomorphic to each other. — Bidhan Roy, Aug 26 '13 at 08:29
@BidhanRoy so, in other words, order of siblings is not important, but all paths to leaves in T1 exist exactly once in T2, and likewise for paths to leaves of T2 in T1. — tucuxi, Oct 22 '19 at 10:55

score 3 · Answer 1 · answered Oct 22 '19 at 10:38

Well hasing a tree means representing it in a unique way so that we can differ other trees from this tree using a simple representation or number. On normal polynomial hash we use number base conversion, we convert a string or a sequence in a specific prime base and use a mod value which is also a large prime. Now using this same technique we can hash a tree.

Now fix the root of the tree at any vertex. Let root = 1 and,

B = The base in which we want to convert.

P[i] = i th power of B (B^i).

level[i] = Depth of the ith vertex where (distance from the root).

child[i] = Total number of the vertex in the subtree of ith vertex including i.

degree[i] = Number of adjacent node of vertex i.

Now the contribution of the ith vertex in the hash value is -

hash[i] = ( (P[level[i]]+degree[i]) * child[i] ) % modVal

And the hash value of the entire tree is the summation of the all vertices hash value-

(hash[1]+hash[2]+....+hash[n]) % modVal

score 1 · Answer 2 · answered Oct 22 '19 at 11:28

If we use this definition of tree equivalence:

T1 is equivalent to T2 iff all paths to leaves of T1 exist exactly once in T2, and all paths to leaves of T2 exist exactly once in T2

Hashing a sequence (a path) is straightforward. If h_tree(T) is a hash of all paths-to-leafs of T, where the order of the paths does not alter the outcome, then it is a good hash for the whole of T, in the sense that equivalent trees will produce equal hashes, according to the above definition of equivalence. So I propose:

h_path(path) = an order-dependent hash of all elements in the path. 
            Requires O(|path|) time to calculate, 
            but child nodes can reuse the calculation of their 
            parent node's h_path in their own calculations.     
h_tree(T) = an order-independent hashing of all its paths-to-leaves. 
            Can be calculated in O(|L|), where L is the number of leaves

In pseudo-c++:

struct node {
    int path_hash;  // path-to-root hash; only use for building tree_hash
    int tree_hash;  // takes children into account; use to compare trees
    int content;
    vector<node> children;
    int update_hash(int parent_path_hash = 1) {
       path_hash = parent_path_hash * PRIME1 + content;     // order-dependent
       tree_hash = path_hash;
       for (node n : children) {
            tree_hash += n.update_hash(path_hash) * PRIME2; // order-independent
       }
       return tree_hash;
    }
};

After building two trees, update their hashes and compare away. Equivalent trees should have the same hash, different trees not so much. Note that the path and tree hashes that I am using are rather simplistic, and chosen rather for ease of programming than for great collision resistance...

Thomas W · Answer 3 · 2013-08-24T11:42:37.470

Child hashes should be successively multiplied by a prime number & added. Hash of the node itself should be multiplied by a different prime number & added.

Cache the hash of the tree overall -- I prefer to cache it outside the AST node, if I have a wrapper object holding the AST.

public class RequirementsExpr {
    protected RequirementsAST ast;
    protected int hash = -1;

    public int hashCode() {
        if (hash == -1)
            this.hash = ast.hashCode();
        return hash;
    }
}

public class RequirementsAST {
    protected int    nodeType;
    protected Object data;
    // -
    protected RequirementsAST down;
    protected RequirementsAST across;

    public int hashCode() {
        int nodeHash = nodeType;
        nodeHash = (nodeHash * 17) + (data != null ? data.hashCode() : 0);
        nodeHash *= 23;            // prime A.

        int childrenHash = 0;
        for (RequirementsAST child = down; child != null; child = child.getAcross()) {
            childrenHash *= 41;    // prime B.
            childrenHash += child.hashCode();
        }
        int result = nodeHash + childrenHash;
        return result;
    }
}

The result of this, is that child/descendant nodes in different positions are always multiplied in by different factors; and the node itself is always multiplied in by a different factor from any possible child/descendant nodes.

Note that other primes should also be used in building the nodeHash of the node data, itself. This helps avoid eg. different values of nodeType colliding with different values of data.

Within the limits of 32-bit hashing, this scheme overall gives a very high chance of uniqueness for any differences in tree-structure (eg, transposing two siblings) or value.

Once calculated (over the entire AST) the hashes are highly efficient.

score 0 · Answer 4 · edited May 23 '17 at 12:13

I would recommend converting the tree to a canonical sequence and hashing the sequence. (The details of the conversion depend on your definition of equivalence. For example, if the trees are binary search trees and the equivalence relation is structural, then the conversion could be to enumerate the tree in preorder, as the structure of binary search trees can be recovered from the preorder enumeration.)

Thomas's answer boils down at first glance to associating a multivariable polynomial with each tree and evaluating the polynomial at a particular location. There are two steps that, at the moment, have to be assumed on faith; the first is that the map doesn't send inequivalent trees to the same polynomial, and the second is that the evaluation scheme doesn't introduce too many collisions. I can't evaluate the first step presently, though there are reasonable definitions of equivalence that permit reconstruction from a two-variable polynomial. The second is not theoretically sound but could be made so via Schwartz--Zippel.

How to calculate Hash value of a Tree

4 Answers4