3

Can Locality Sensitive Hashing used on dynamic data? For example assume I use LSH first on a 1,000,000 documents and store the results on a index, then I want to add another document to the index created. Can I do it using LSH?

Janitha Tennakoon
  • 856
  • 11
  • 40

2 Answers2

2

Yes.

As lsh uses multiple hashes to generate multiple signatures, then this signatures banded to generate indexes. If you store the random hash functions and banding process, you can reuse that to generate index for the new insertion. Thus for every new insert you will have corresponding index

Averman
  • 441
  • 2
  • 8
1

Yes you can do this. You would only have to calculate the Jaccard Similarities for the added document vs the rest and add that to your index.

TABLE Documents (
  ID INT IDENTITY(1,1) PRIMARY KEY NOT NULL, 
  MinHashes BINARY(512), -- serialized Min Hash results
  Name NVARCHAR(255) UNIQUE NOT NULL, 
  Content VARBINARY(MAX)
)

TABLE SimilarDocumentIndex (
  DocumentAID INT REFERENCES Documents(ID),
  DocumentBID INT REFERENCES Documents(ID),
  Similarity FLOAT, -- Jaccard Similarity 0.0...1.0
  PRIMARY KEY CLUSTERED (DocumentAID, DocumentBID)
)

--
-- Find similar documents
--
SELECT TOP 20 DISTINCT DocumentID
FROM (SELECT 
FROM SimilarDocumentIndex 
WHERE DocumentAID = @DocumentID 
ORDER BY Similarity DESC

--
-- Compare two documents
--    
SELECT Similarity 
FROM SimilarDocumentIndex 
WHERE DocumentAID = @DocumentAID AND DocumentBID = @DocumentBID

--
-- Adding a new document
--
SET @MinHashes = dbo.CalcMinHashes(@content)

INSERT INTO Document 
VALUES(@MinHashes, @name, @content)

SET @DocumentID = SCOPE_IDENTITY()

INSERT INTO SimilarDocumentIndex
  SELECT @DocumentID, ID, dbo.JaccardSimilarity(@MinHashes, MinHashes)
  FROM Documents 
  WHERE ID <> @DocumentID 

INSERT INTO SimilarDocumentIndex
  SELECT DocumentBID, @DocumentID, Similarity
  FROM SimilarDocumentIndex
  WHERE DocumentAID = @DocumentID
Louis Ricci
  • 20,804
  • 5
  • 48
  • 62
  • Sorry, but by index I did not mean database indexing, rather a hashtable data structure. I want to know if I can again use the LSH algorithm on the new document to allocate that document to the same bucket as done on 1000 documents. – Janitha Tennakoon Sep 01 '15 at 18:49
  • @janitha000 - As long as you don't have a maximum bucket size I think it would work without an issue. So it really depends on the implementation you are using. – Louis Ricci Sep 01 '15 at 19:08
  • Calculating the Jaccard Similarities for the added document vs the rest will not be an option because now I have several millions of documents. – Janitha Tennakoon Sep 06 '15 at 03:45
  • @janitha000 - calculating the similarities for a million documents isn't really that bad, I would try benchmarking an implementation before assuming it wouldn't be feasible. Maybe you could leave the similarity blank and use an asynchronous job to update it every ~hour, or minute or whenever. – Louis Ricci Sep 06 '15 at 15:40