Hashing methodology for collection of strings and integer ranges

Question

I have a data, for example per the following:

I need to match the content with the input provided for the Content & Range fields to return the matching rows. As you can see the Content field is a collection of strings & the Range field is a range between two numbers. I am looking at hashing the data, to be used for matching with the hashed input. Was thinking about Iterating through the collection of individual strings hashcode & storing it for the Content field. For the Range field I was looking at using interval trees. But then the challenge is when i hash the Content input & Range input how will i find if it that hashcode is present in the hashcode generated for the collection of strings in the Content fields & the same for the Range fields.

Please do let me know if there are any other alternate ways in which this can be achieved. Thanks.

Can you please add an example of the input query and output result? — displayName, Apr 04 '17 at 18:02
The input will be ContentField = "Gfd" , RangeValue = 2. The output for this will be RowID 1 & 2. — Sanal, Apr 04 '17 at 18:27
Sorry there was a typo, i was correcting the edit but it wasn't allowing me to edit saying the limit is only 5 minutes. This is what I intended to say -The input will be ContentField = "Gfd" , RangeValue = 2. The output for this will be RowID 2 as only this row satisfies both the input condition. If the input is ContentField = "Gfd" , RangeValue = 1 then the output will be none as there is no row which satisfies both the condition. If ContentField = "Uew" , RangeValue = 1 then the output will be the RowID 3. — Sanal, Apr 04 '17 at 18:45
The data is in database, stored in json format. Its retrieved and stored in memory in the application. The execution is handled in the middle tier. — Sanal, Apr 04 '17 at 19:50

score 1 · Accepted Answer · answered Apr 04 '17 at 20:09

1

There is a simple solution to your problem: Inverted Index.

For each item in content, create the inverted index that maps 'Content' to 'RowID', i.e. create another table of 2 columns viz. Content(string), RowIDs(comma separated strings).

For your first row, add the entries {Azd, 1}, {Zax, 1}, {Gfd, 1}..., {Mni, 1} in that table. For the second row, add entries for new Content strings. For the Content string already present in the first row ('Gfd', for example), just append the new row id to the entry you created for first row. So, Gfd's row will look like {Gfd, 1,2}.

When done processing, you will have the table that will have 'Content' strings mapped to all the rows in which this content string is present.

Do the same inverted indexing for mapping 'Range' to 'RowID' and create another table of Range(int), RowIDs(comma seperated strings).

Now, you will have a table whose rows will tell which range is present in which row ids.

Finally, for each query that you have to process, get the corresponding Content and Range row from the inverted index tables and do an intersection of those comma seperated list. You will get your answer.

answered Apr 04 '17 at 20:09

displayName

13,888
8
60
75

That looks good. But for the Range field if the range is a bit bigger in size say 1-100 then won't the table get big soon, considering other rows too. – Sanal Apr 04 '17 at 20:16
1

@Sanal: it will get as big as your input data. Create a row for the range only when that range number is added to the (rowid, content, range) table. So, when for some row a range 10-12 is added, you will either add three new rows for 10, 11 and 12 OR you will only append the row ids if these range values preexist in your inverted index table. – displayName Apr 04 '17 at 20:19
I am doing the processing in the middle tier. Have stored the inverted index in a Dictionary>> collection in C#. Guess you were pointing to storing the inverted index in DB. Since there are multiple inputs it becomes paramount to retrieve the result fast. I read here - https://nlp.stanford.edu/IR-book/html/htmledition/processing-boolean-queries-1.html#fig:postings-merge-algorithm that linear merge will be the faster approach for retrieval from Inverted Indexes. Any pointers here will be great. – Sanal Apr 10 '17 at 16:04
@Sanal: What's the issue exactly? The storage of the two inverted indices or their merge. – displayName Apr 10 '17 at 16:15
I have stored it in a dictionary in middle tier. I want to retrieve it in linear time when provided with the inputs. Is there any data structure other than this which you advocate for storing the inverted indices, such that it aids in returning the values in linear time. – Sanal Apr 10 '17 at 16:30
@Sanal: Put the rowIds for Content and Range values in a HashSet. When you get a query for some Content and Range value, you will have two HashSWet - one for the Content, other for Range. Whichever is smaller in size, read its values one by one and search for that value in other set. This search will be an O(1) operation and since you do it for each item in the smaller set, it will be overall an O(n) operation if the number of ints (i.e. RowIds) in the smaller HashSet is n. – displayName Apr 10 '17 at 16:53
But Hashset only allows for the RowIds to be stored, while the search is to be done using the values of the Content & Range fields. How do I search for the rowId's directly without using the input values. – Sanal Apr 10 '17 at 17:22
1

@Sanal: The inverted index, for e.g. for Content -> RowIds, would look like this: private HashMap> _contentInvertedIndex = new... So, it will be a HashMap of strings, with each key pointing to a HashSet. When you get a query of some string "xyz", you get the HashSet h1 = _contentInvertedIndex["xyz"]. Similarly from other inverted index (HashMap>) you get the other HashSet h2 and then you call h1.Intersect(h2) to get the common RowIds. Does that make sense? – displayName Apr 10 '17 at 17:39
Thanks @displayName. Wanted to make sure, which data structure you were referring to for using the values. This helps. – Sanal Apr 11 '17 at 04:33
If we add another column Quantity which has values like for example "> 100" or "> 200" or ">300" then how will it be added to the inverted indices. I was thinking to have an extension method for the Hashset to find out the first value in the Quantity field which is the closest value lesser than the input provided. For example, if I provide the input as 101 then it will pick up the field in Quantity ">100". Is that the best way to do? – Sanal Apr 11 '17 at 13:51
@Sanal: Seems fine to me. In general, be bold and go with the design you *feel* is right. If it is right, great. If it turns out to be wrong, you learned a lesson and that's great too. :) – displayName Apr 11 '17 at 14:32
Sure. Just wanted to check, as this was a different approach from how the inverted indexes work. – Sanal Apr 11 '17 at 14:49

Hashing methodology for collection of strings and integer ranges

1 Answers1