Can Locality Sensitive Hashing used on dynamic data? For example assume I use LSH first on a 1,000,000 documents and store the results on a index, then I want to add another document to the index created. Can I do it using LSH?
Asked
Active
Viewed 410 times
2 Answers
2
Yes.
As lsh uses multiple hashes to generate multiple signatures, then this signatures banded to generate indexes. If you store the random hash functions and banding process, you can reuse that to generate index for the new insertion. Thus for every new insert you will have corresponding index

Averman
- 441
- 2
- 8
1
Yes you can do this. You would only have to calculate the Jaccard Similarities for the added document vs the rest and add that to your index.
TABLE Documents (
ID INT IDENTITY(1,1) PRIMARY KEY NOT NULL,
MinHashes BINARY(512), -- serialized Min Hash results
Name NVARCHAR(255) UNIQUE NOT NULL,
Content VARBINARY(MAX)
)
TABLE SimilarDocumentIndex (
DocumentAID INT REFERENCES Documents(ID),
DocumentBID INT REFERENCES Documents(ID),
Similarity FLOAT, -- Jaccard Similarity 0.0...1.0
PRIMARY KEY CLUSTERED (DocumentAID, DocumentBID)
)
--
-- Find similar documents
--
SELECT TOP 20 DISTINCT DocumentID
FROM (SELECT
FROM SimilarDocumentIndex
WHERE DocumentAID = @DocumentID
ORDER BY Similarity DESC
--
-- Compare two documents
--
SELECT Similarity
FROM SimilarDocumentIndex
WHERE DocumentAID = @DocumentAID AND DocumentBID = @DocumentBID
--
-- Adding a new document
--
SET @MinHashes = dbo.CalcMinHashes(@content)
INSERT INTO Document
VALUES(@MinHashes, @name, @content)
SET @DocumentID = SCOPE_IDENTITY()
INSERT INTO SimilarDocumentIndex
SELECT @DocumentID, ID, dbo.JaccardSimilarity(@MinHashes, MinHashes)
FROM Documents
WHERE ID <> @DocumentID
INSERT INTO SimilarDocumentIndex
SELECT DocumentBID, @DocumentID, Similarity
FROM SimilarDocumentIndex
WHERE DocumentAID = @DocumentID

Louis Ricci
- 20,804
- 5
- 48
- 62
-
Sorry, but by index I did not mean database indexing, rather a hashtable data structure. I want to know if I can again use the LSH algorithm on the new document to allocate that document to the same bucket as done on 1000 documents. – Janitha Tennakoon Sep 01 '15 at 18:49
-
@janitha000 - As long as you don't have a maximum bucket size I think it would work without an issue. So it really depends on the implementation you are using. – Louis Ricci Sep 01 '15 at 19:08
-
Calculating the Jaccard Similarities for the added document vs the rest will not be an option because now I have several millions of documents. – Janitha Tennakoon Sep 06 '15 at 03:45
-
@janitha000 - calculating the similarities for a million documents isn't really that bad, I would try benchmarking an implementation before assuming it wouldn't be feasible. Maybe you could leave the similarity blank and use an asynchronous job to update it every ~hour, or minute or whenever. – Louis Ricci Sep 06 '15 at 15:40