We have a use case for bit-level querying in Hadoop. It goes something like this:
Given a set of variable-length records containing a date/time stamp and one or more sixteen bit data words, return a list of date/time stamps where some combination of arbitrary bits from one or more arbitrary data words is set to the values specified in the query.
Example... Given the following data:
Timestamp Word 1 bits Word 2 bits
------------------ ---------------------- ---------------------
2017-06-16 08:15:05 0010 1101 1111 0000 1011 0010 1111 0010
2017-06-16 08:15:06 0010 1110 1111 0000 ...
2017-06-16 08:15:07 0010 1101 1111 0000 ...
2017-06-16 08:15:08 0010 1110 1111 0000
2017-06-16 08:15:09 0010 1101 1111 0000
2017-06-16 08:15:10 0010 1110 1111 0000
If the query is "Return all of the time stamps where word 1 bit zero is 0 and word 1 bit one is 1", the result would be
Timestamp Word 1 bits
------------------ ----------------------
2017-06-16 08:15:06 0010 1110 1111 0000
2017-06-16 08:15:08 0010 1110 1111 0000
2017-06-16 08:15:10 0010 1110 1111 0000
^^
The data is available in tab-delimited form as hex values:
Timestamp Word1 Word2 Word3 Word4
------------------ ---- ---- ---- ----
2017-06-16 08:15:05 2DF0 ... a varying number of 16 bit data words continues out here.
2017-06-16 08:15:06 2EF0
2017-06-16 08:15:07 2DF0
2017-06-16 08:15:08 2EF0
2017-06-16 08:15:09 2DF0
2017-06-16 08:15:10 2EF0
...
We've been kicking around how we might represent this data in the Hadoop hive and query on it. Putting each bit of each data word into its own integer field seems wildly inefficient, but has the virtue of being queryable directly by Hadoop, assuming that a variable number of columns in each record can be accommodated by the Hadoop server.
To solve this problem, I floated the suggestion that we import this data into the hive as first-class time stamps and 16 bit unsigned integers, and then construct a MapReduce job for each query using bit-extraction Java functions to construct a temporary table having a timestamp field and each bit
of interest in its own first-class integer. The required Hadoop query to get the final result from the temporary would be, shall we say, trivial.
However, the idea that is currently being proposed is to save the Hexadecimal text directly into the data lake. Our data scientist seems to think that such an arrangement would allow direct querying; that is, no temporary table is required, and that the hex format offers reasonably efficient storage.
How might this work? Is there some way to index such text and then do some sort of bit-level text search on it, masking off the bits that are not of interest?
(I'll entertain suggestions on how this problem could potentially be solved in a better way.)