0

I am currently investigating using sha1 hashing which will be stored in MS Sql Server as binary(20). Looking at the currently available datatypes in Solr 4.x, the only one that seems large enough is binary. However, I am unsure if using binary as the uniquekey is a good idea. Also, in the near future we will be moving from Solr 4.x standalone to Solr 6.x cloud.

BillS
  • 1
  • 2

1 Answers1

1

As per best practice, the unique key should be a short unique String (see Java UUID for instance). To use binary as unique key is not a good idea nor recommended. A viable solution to your problem though can be found in this page from Solr Documentation:

Cryptographic hash

A cryptographic hashing algorithm can be thought of as creating N very random bits from the input data. The MD5 algorithm create 128 bits. This means that 2 input data sets have a chance of 1 in 2^128 of creating the same MD5. There is a standard expression of this as 32 hexadecimal characters. RFC-1321. Several MD5 digest algorithm packages for various languages do not follow this standard. The UUID standard always includes the time at the creation of the UUID, which precludes some of the above use cases. You can cheat and ignore the clock requirement. It is best to use the UUID text format: 550e8400-e29b-41d4-a716-446655440000 instead of 550e8400e29b41d4a716446655440000. (You will read many of these keys.) One advantage in using a crypto-generated unique key is that you can select a random subset of documents via wildcards. If the UUID data is saved as a string in the 32-character RFC format, 'd3adbe3fdeadb3e4deadbee4deadb3ef', the query "id:a*" will select a random 1/16 of the entire document set. "id:aa*" selects 1/256 of the document set, again very randomly. Statistical analysis and data extraction projects can use this to select small subsets instead of walking the entire index.

The same approach will work well with any version of Solr.

Community
  • 1
  • 1
AR1
  • 4,507
  • 4
  • 26
  • 42
  • We are using both MS SQL Server & Solr 4.x (currently standalone but moving to SolrCloud). I am using the sha1 hashing algorithm to avoid duplicates (So using solr or sql to gen uuid is out... though I didn't know I could remove the clocking component in Java which may yield same hash given same input string (?). And we are using Java for our web app. What I needed: 1. Java compares solr field and MS sql column 2. Solr compares solr field and MS sql column on data import Unfortunately MS SQL hashbytes() returns varbinary (20). – BillS Dec 28 '16 at 16:40
  • What I ended up doing was using apache-codec DigestUtils.shaHex("some string") to create a unique hex string key that is the primary key in sql and the unique key in solr. The more I thought about it, I didn't feel comfortable using binary as a key in solr. Thanks for the help – BillS Dec 28 '16 at 16:40