1

We have a use case for bit-level querying in Hadoop. It goes something like this:

Given a set of variable-length records containing a date/time stamp and one or more sixteen bit data words, return a list of date/time stamps where some combination of arbitrary bits from one or more arbitrary data words is set to the values specified in the query.

Example... Given the following data:

Timestamp             Word 1 bits                Word 2 bits
------------------    ----------------------     ---------------------          
2017-06-16 08:15:05   0010  1101  1111  0000     1011  0010  1111  0010
2017-06-16 08:15:06   0010  1110  1111  0000     ...
2017-06-16 08:15:07   0010  1101  1111  0000     ...
2017-06-16 08:15:08   0010  1110  1111  0000
2017-06-16 08:15:09   0010  1101  1111  0000
2017-06-16 08:15:10   0010  1110  1111  0000

If the query is "Return all of the time stamps where word 1 bit zero is 0 and word 1 bit one is 1", the result would be

Timestamp             Word 1 bits
------------------    ----------------------
2017-06-16 08:15:06   0010  1110  1111  0000
2017-06-16 08:15:08   0010  1110  1111  0000
2017-06-16 08:15:10   0010  1110  1111  0000
                              ^^

The data is available in tab-delimited form as hex values:

Timestamp             Word1  Word2  Word3  Word4  
------------------    ----   ----   ----   ----
2017-06-16 08:15:05   2DF0  ... a varying number of 16 bit data words continues out here.
2017-06-16 08:15:06   2EF0
2017-06-16 08:15:07   2DF0
2017-06-16 08:15:08   2EF0
2017-06-16 08:15:09   2DF0
2017-06-16 08:15:10   2EF0
...

We've been kicking around how we might represent this data in the Hadoop hive and query on it. Putting each bit of each data word into its own integer field seems wildly inefficient, but has the virtue of being queryable directly by Hadoop, assuming that a variable number of columns in each record can be accommodated by the Hadoop server.

To solve this problem, I floated the suggestion that we import this data into the hive as first-class time stamps and 16 bit unsigned integers, and then construct a MapReduce job for each query using bit-extraction Java functions to construct a temporary table having a timestamp field and each bit of interest in its own first-class integer. The required Hadoop query to get the final result from the temporary would be, shall we say, trivial.

However, the idea that is currently being proposed is to save the Hexadecimal text directly into the data lake. Our data scientist seems to think that such an arrangement would allow direct querying; that is, no temporary table is required, and that the hex format offers reasonably efficient storage.

How might this work? Is there some way to index such text and then do some sort of bit-level text search on it, masking off the bits that are not of interest?

(I'll entertain suggestions on how this problem could potentially be solved in a better way.)

Robert Harvey
  • 178,213
  • 47
  • 333
  • 501
  • "Why don't you just ask your data scientist?" That's a good question. It's a long story; the short answer is that I don't have ready access to the data scientist, and I'm not a Hadoop expert. – Robert Harvey Jun 16 '17 at 17:14
  • please advise https://stackoverflow.com/questions/50428444/how-to-handle-xml-file-in-hive – Kishore May 21 '18 at 16:58

1 Answers1

2

Demo

data.tsv

2017-06-16 08:15:05 2DF0
2017-06-16 08:15:06 2EF0    0000
2017-06-16 08:15:07 2DF0    AAAA    BBBB    CCCC
2017-06-16 08:15:08 2EF0    1111    2222
2017-06-16 08:15:09 2DF0    
2017-06-16 08:15:10 2EF0    DDDD    EEEE

create external table mytable
(
    ts          timestamp
   ,words       string
)
row format delimited
fields terminated by '\t'
stored as textfile
tblproperties ('serialization.last.column.takes.rest'='true')
;

select  *
       
from    mytable
;

+----------------------------+---------------------------+
|             ts             |            words          |
+----------------------------+---------------------------+
| 2017-06-16 08:15:05.000000 | 2DF0                      |
| 2017-06-16 08:15:06.000000 | 2EF0 0000                 |
| 2017-06-16 08:15:07.000000 | 2DF0 AAAA    BBBB    CCCC |
| 2017-06-16 08:15:08.000000 | 2EF0 1111    2222         |
| 2017-06-16 08:15:09.000000 | 2DF0                      |
| 2017-06-16 08:15:10.000000 | 2EF0 DDDD    EEEE         |
+----------------------------+---------------------------+

select  ts
       ,split(words,'\\t')  as words
       
from    mytable
;

+----------------------------+-------------------------------+
|             ts             |             words             |
+----------------------------+-------------------------------+
| 2017-06-16 08:15:05.000000 | ["2DF0"]                      |
| 2017-06-16 08:15:06.000000 | ["2EF0","0000"]               |
| 2017-06-16 08:15:07.000000 | ["2DF0","AAAA","BBBB","CCCC"] |
| 2017-06-16 08:15:08.000000 | ["2EF0","1111","2222"]        |
| 2017-06-16 08:15:09.000000 | ["2DF0",""]                   |
| 2017-06-16 08:15:10.000000 | ["2EF0","DDDD","EEEE"]        |
+----------------------------+-------------------------------+

select  ts
       ,lpad(conv(split(words,'\\t')[0],16,2),16,'0')  as word1_bits
       
from    mytable
;

+----------------------------+------------------+
|             ts             |    word1_bits    |
+----------------------------+------------------+
| 2017-06-16 08:15:05.000000 | 0010110111110000 |
| 2017-06-16 08:15:06.000000 | 0010111011110000 |
| 2017-06-16 08:15:07.000000 | 0010110111110000 |
| 2017-06-16 08:15:08.000000 | 0010111011110000 |
| 2017-06-16 08:15:09.000000 | 0010110111110000 |
| 2017-06-16 08:15:10.000000 | 0010111011110000 |
+----------------------------+------------------+

select  ts
       
from    mytable

where   substr(lpad(conv(split(words,'\\t')[0],16,2),16,'0'),7,2) = '10'
;

+----------------------------+
|             ts             |
+----------------------------+
| 2017-06-16 08:15:06.000000 |
| 2017-06-16 08:15:08.000000 |
| 2017-06-16 08:15:10.000000 |
+----------------------------+

Alternative data structure

create external table mytable
(
    ts          timestamp
   ,word1       string
   ,word2       string
   ,word3       string
   ,word4       string
   ,word5       string
   ,word6       string
   ,word7       string
   ,word8       string
   ,word9       string
)
row format delimited
fields terminated by '\t'
stored as textfile
;

select * from mytable
;

+----------------------------+-------+--------+--------+--------+--------+--------+--------+--------+--------+
|             ts             | word1 | word2  | word3  | word4  | word5  | word6  | word7  | word8  | word9  |
+----------------------------+-------+--------+--------+--------+--------+--------+--------+--------+--------+
| 2017-06-16 08:15:05.000000 | 2DF0  | (null) | (null) | (null) | (null) | (null) | (null) | (null) | (null) |
| 2017-06-16 08:15:06.000000 | 2EF0  | 0000   | (null) | (null) | (null) | (null) | (null) | (null) | (null) |
| 2017-06-16 08:15:07.000000 | 2DF0  | AAAA   | BBBB   | CCCC   | (null) | (null) | (null) | (null) | (null) |
| 2017-06-16 08:15:08.000000 | 2EF0  | 1111   | 2222   | (null) | (null) | (null) | (null) | (null) | (null) |
| 2017-06-16 08:15:09.000000 | 2DF0  |        | (null) | (null) | (null) | (null) | (null) | (null) | (null) |
| 2017-06-16 08:15:10.000000 | 2EF0  | DDDD   | EEEE   | (null) | (null) | (null) | (null) | (null) | (null) |
+----------------------------+-------+--------+--------+--------+--------+--------+--------+--------+--------+
Community
  • 1
  • 1
David דודו Markovitz
  • 42,900
  • 6
  • 64
  • 88