9

I have a set of bit patterns, and want to find the index of the element in the set which matches a given input. The bit pattern contains "don't care" bits, that is x-es which matches both 0 and 1.

Example The set of bit patterns are

index abcd
   0  00x1
   1  01xx
   2  100x
   3  1010
   4  1x11

Then, trying to match 0110 should return index 1 and 1011 should return index 4.

How can this problem be solved faster than a linear search through the elements? I guess a kind of binary tree could be made, but then, what is a smart way of creating such a tree? Are there other efficient data structures/algorithms for such a problem, primarily in terms of query efficiency both also storage requirements.

  • The bit patterns will be 64 bits (or more)
  • The number of elements in the set will be in the order 10^5 - 10^7
  • Not all bit combinations are represented in the set, e.g in the example 0000 is not represented
  • There will be a high number of x-es in the data set
  • A bit string will match only one of the elements in the set

I have two different cases in which I need to solve this problem

  • Case 1: I have the possibility of doing a lot of precomputing
  • Case 2: New elements will be added to the set on the fly

Update The x-es are more likely to show up in some bit positions than others, that is, some bit positions will be dominated by x-es while others will be mainly zeroes/ones.

Petter T
  • 3,387
  • 2
  • 19
  • 31
  • 1
    Could your pattern intersect, or if a bit string matches to one, it would never match others? – Gangnus Feb 10 '14 at 09:53
  • A bit string would match only one element. – Petter T Feb 10 '14 at 09:56
  • Add that to the list of conditions then, please. – Gangnus Feb 10 '14 at 09:56
  • 1
    How would you want to use a binary tree here? Since your `x`-es allow both 0 or 1, the tree would have to split into _both_ branches then wherever an `x` occurs IMHO … – CBroe Feb 10 '14 at 10:10
  • 1
    looks like a typical combinatorial explosion problem, for which there's no fast generic solution. I doubt it's doable in less than O(N) (where N is the number of elements). as a sidenote: http://en.wikipedia.org/wiki/Content-addressable_memory#Ternary_CAMs – Karoly Horvath Feb 10 '14 at 10:29
  • @Cbroe: Are you referring to when you are querying the tree? If so, I don't think that would be a problem, since the bit strings (inputs) you are trying to match don't contain any 'x'-es – Petter T Feb 10 '14 at 10:31
  • 1
    @PetterT: no, he was referring to storing. you have to put each mask into both subtrees. since you have many x-es, you end up with each element stored at 2^many places. which is just the combinatorial explosion I mentioned. – Karoly Horvath Feb 10 '14 at 10:33
  • @Karoly Horvath: I agree that elements in general may be put into both subtrees. But e.g. for the example provided, one could create a tree of height 3 with each element only stored once. – Petter T Feb 10 '14 at 10:48
  • @PetterT: I don't understand. Let's call it luck? Or have you forgotten to tell some important property of those patterns? – Karoly Horvath Feb 10 '14 at 11:07
  • @Karoly Horvath: Binary tree for provided example: First level: Check bit 'a'. Index '0' and '1' goes to left sub tree, '2', '3' and '4' goes to right. Second level, left node: Check bit 'b' to select '0' or '1'. Second level, right node: check bit 'c'. '2' goes to left sub tree, '3' and '4' to right. Third level, only one node with more than one element. These two elements ('3' and '4') can be separated by checking bit 'd'. – Petter T Feb 10 '14 at 11:15
  • @Karoly Horvath: Maybe a bit more than luck :-) I have added one more condition to the problem (x-es are more likely to show up in some bit positions than others) – Petter T Feb 10 '14 at 11:27
  • then looks like you can build a tree, the levels will be ordered by the probability that an `x` is in that position, with the top on the columns which don't have an `x` at all. – Karoly Horvath Feb 10 '14 at 12:03
  • About how many patterns of `x`'s will you have? (I see 4 in your example: `__x_ , __xx , ___x, _x__`) – גלעד ברקן Feb 10 '14 at 17:20
  • @groovy: There will be x's in a majority of the elements – Petter T Feb 10 '14 at 19:46
  • @PetterT Thanks, but that's not my question. My question is how many x-patterns. For example, you could have all elements with only one x-pattern (say, x in positions 3,5,6,7). – גלעד ברקן Feb 10 '14 at 19:52
  • @groovy: I have not yet made the code to generate the data, so hard to give exact answers. The number of patterns will be small compared to to number of elements. My best guess would be a few hundred. – Petter T Feb 10 '14 at 21:06

6 Answers6

3

I think you can build a trie tree for the bit patterns, the node contains the original index of the pattern.

To complete the match is just to search in a trie tree, when the trie node contains the same bit of 'x', go to the next node. The result may contain multiple indexes for a certain input.

Here is my solution,

public class Solution {

    public static class Trie<T> {
        private final Character WILD = 'x';
        private Map<Character, Trie> children;
        private boolean isNode;
        private T value;

        public Trie() {
            children = new HashMap<Character, Trie>();
            isNode = false;
            value = null;
        }

        public void insert(String key, T value) {
            Trie<T> current = this;
            for (int i = 0; i < key.length(); i++) {
                char c = key.charAt(i);
                if (current.children.containsKey(c)) {
                    current = current.children.get(c);
                } else {
                    Trie<T> next = new Trie();
                    current.children.put(c, next);
                    current = next;
                }
            }
            current.isNode = true;
            current.value = value;
        }

        public List<T> get(String key) {
            List<T> result = new ArrayList<T>();
            get(this, key.toCharArray(), 0, result);
            return result;
        }

        private void get(Trie<T> trie, char[] chars, int index, List<T> result) {
            if (index == chars.length) {
                if (trie != null && trie.isNode) {
                    result.add(trie.value);
                }
                return;
            }
            char c = chars[index];
            if (trie.children.containsKey(c)) {
                get(trie.children.get(c), chars, index + 1, result);
            }
            if (trie.children.containsKey(WILD)) {
                get(trie.children.get(WILD), chars, index + 1, result);
            }
        }
    }

    public static void main(String[] args) {
        Trie<Integer> trie = new Trie<Integer>();
        trie.insert("00x1", 0);
        trie.insert("01xx", 1);
        trie.insert("100x", 2);
        trie.insert("1010", 3);
        trie.insert("1x11", 4);
        System.out.println(trie.get("0110")); // [1]
        System.out.println(trie.get("1011")); // [4]
    }
}
Qiang Jin
  • 4,427
  • 19
  • 16
  • A pseudo-code implementation would be more useful for those unfamiliar with Java; this is a language-agnostic question. – chepner Feb 10 '14 at 14:51
2

You can build an automaton that matches a string in time linear in the length of the string here. For instance, you could store the set of strings---or, indeed, a function on the strings---in a (reduced, ordered) binary decision diagram. I suspect a BDD for any set of strings-with-don't-cares will have size linear in the total number of symbols, but I don't have a proof.

A BDD solution will be similar to, but slightly different from, Qiang Jin's excellent solution, where construction definitely takes linear space but queries aren't obviously (to me) fast in the worst case.

tmyklebu
  • 13,915
  • 3
  • 28
  • 57
1

I think, the solution for pattern container will be a specific ordered tree.

  • The node of the tree will say:
    • what position it is about (node.position)
    • What bit is in this position (0,1,x) (node.value)
  • Only leaf nodes can have x as value.
  • Position of the child should be always greater than that of the parent - to exclude the duplicite branches.
  • If a node has many children, they are ordered so:
    • first by position
    • of two children with same position the first is one with value 0.
  • The root node of a such tree is empty.
  • The tree are read so:
    • starting at root, get a path to the leaf, taking 1 and 0 and putting them on appropriate positions.
    • When we arrive at an x, fill all free positions with x-es.
    • If we do not arrive at x, the leaf has 1/0 value and the pattern is filled. If it is not filled, an error happened.

The matching in that node should be done not by leaves, but by levels. A level will be a set of children of the one parent.

Take first level of the children as current level
Take the first child on the level for current
  read currentnode.position
  check the appropriate position in the matched string against child value. 
  If it fits, go higher up the tree.
  If it doesn't fit, go to next child.
  If we are out of children on the level, go down the tree.

The complexity of both pattern adding and binary string matching is log(n) here. If there are a% of x'es, the time will be shorter by a% approximately, as opposed to the solution of @Qiang Jin. And search multi-branched trees are faster than in merely three-branched ones.

I would implement that tree as a hierarchy of lists.

Gangnus
  • 24,044
  • 16
  • 90
  • 149
0

If the overall number of x-patterns is relatively small, you could keep a list of them, and use a hashtable for the elements where all x's in the keys are set to ones (or zeros, it doesn't matter).

Then look up the query and all its modified forms, that is, where some of the query's bits are changed according to the x-pattern. (As the example suggests, perhaps checking which x-patterns would modify the query could be made efficient.)

To take your own example:

index  abcd    hash-key  x-patterns
0      00x1 => 0011      0010
1      01xx => 0111      0011
2      100x => 1001      0001
3      1010 => 1010      N/A
4      1x11 => 1111      0100

To match 0110, the first x-pattern does not modify 0110; 0111 matches index 1.
To match 1011, the first 3 x-patterns do not modify 1011; 1111 matches index 4. 

JavaScript code:

var hash = {3: 0, 7: 1, 9: 2, 10: 3, 15: 4}
  , x_patterns = [2,3,1,4]

function lookup(query){
    var mask = query ^ (Math.pow(2,31) - 1)

    if (hash[query]){
       return hash[query]
    } else {
       var i = 0
       while (x_patterns[i]){
          if (mask & x_patterns[i])
              if (hash[query | x_patterns[i]])
                  return hash[query | x_patterns[i]]
          i++
       }
       return false
    }
}

console.log(lookup(11), lookup(6))

Output:

4 1
גלעד ברקן
  • 23,602
  • 3
  • 25
  • 61
0

I can think on one fast and simple solution,

First of all, it the number of "don't care" is small, you can simply expand the index and use a regular hash (python dict, C++ map, etc), in this case this:

index abcd
   0  00x1

becomes this:

index abcd
   0  0001
   1  0011

And the search over the index is the fastest possible one.

Hope it helps!

Sergio Ayestarán
  • 5,590
  • 4
  • 38
  • 62
0

I just faced the same problem. In my case, I'm working on an emulator and performance is a big concern for me. However, the idea of just using the pattern that is inside the processor's architecture documentation is so appealing. If the language you are using has any Metaprogramming feature then, you should consider it. I'm using Zig which has such construct. In my case since the pattern comes from documentation, all the information that I need is present during the compile time. Below is my code. comptime means the expression is known during compile time. So, the whole block is evaluated by the compiler. The only expressions that are actual program code are any expression involving value i.e. the last three lines.

const std = @import("std");
const info = std.log.info;
const assert = std.debug.assert;

pub fn main() void {
    const t = 0b110001;
    const match = matchBitPattern(u6, "x10x01", t);
    info("is a match {any}", .{match});
}

fn matchBitPattern(comptime T: type, comptime pattern: []const u8, value: T) bool {
    const p: [2]T = comptime blk: {
        const l = @bitSizeOf(T);
        assert(pattern.len == l);
        var set: T = 0;
        var reset: T = 0;
        for (pattern) |c, i| {
            _ = switch (c) {
                '1' => {
                    set = (1 <<  (l - i - 1)) | set;
                },
                '0' => {
                    reset = (1 << (l - i - 1)) | reset;
                },
                'x' => void,
                else => @compileError("pattern should consists of 1, 0 or x")
            };
        }

        break :blk .{ set, reset };
    };
    const set = comptime p[0];
    const reset = comptime p[1];

    const is_set = (set & value) == set;
    const is_reset = (reset & ~value) == reset;

    return is_set and is_reset;
}
artronics
  • 1,399
  • 2
  • 19
  • 28