0

I'm working on creating database, there are multiple tables such as Apps, Space, Builds, Processes.

The app which I'm going to use this database for is like a real-time dashboard for platform monitoring, and the database will update every second with new information.

Rows will be repeated with apps_guid whenever new information is received and that is the reason I can not use apps_guid as a primary key.

I'm planning to use Murmur3 hash as a primary key in these tables and use to check duplicates.

I know Cassandra uses it as a primary key but not sure how it will preform in relational databases, is it a good idea? do you have any better recommendations?

Apps Column:

--
id int IDENTITY
app_guid binary
name varchar(255)
state varchar(255)
created_at datetime
updated_at datetime
type varchar(255)
stack varchar(255)
apphash binary PK
timestamp datetime
Jojo
  • 127
  • 1
  • 2
  • 12
  • If it is always unique then it might make a good primary key. But it will make a very poor clustered index. You would want to create a clustered index on another column that would make sense. – Sean Lange Aug 16 '19 at 13:48
  • It will be unique, I mean if data changes there will be new hash. I'm not really familiar with creating clustered index, thank you for pointing to direction though. – Jojo Aug 16 '19 at 13:55
  • By default your primary key will be the clustered index. And things like a hash will make a poor clustered index because the values are all over the place. The fragmentation will hit 99% with as few as 2-3 thousand rows. – Sean Lange Aug 16 '19 at 14:52
  • So should I make `id` as my primary key? because that's the only option I have. As I mentioned I can not use `app_guid` as a primary key. – Jojo Aug 16 '19 at 14:55
  • I can't tell you if that is the right choice but it sounds reasonable. What are you going to use for your clustered index? – Sean Lange Aug 16 '19 at 14:59
  • @SeanLange I really dont know because except `id` and `apphash` everything else is not unique – Jojo Aug 16 '19 at 15:02
  • timestamp should be unique per app right? It sounds reasonable to have (app_guid,timestamp) as PK, which can speed up queries for the latest app states – PeterHe Aug 16 '19 at 15:18
  • 1
    How can you say for sure that the hash code will be unique? It seems to me like you're making an invalid assumption at the start. What data do you pass to the hashing function? – Jim Mischel Aug 16 '19 at 15:23
  • @JimMischel well I'm sure it will be unique because the data involves date and time, and passing `app_guid binary, name varchar(255), state varchar(255), created_at datetime, updated_at datetime` – Jojo Aug 16 '19 at 15:29
  • @PeterHe oh good point – Jojo Aug 16 '19 at 15:30
  • not sure why the id column is added, it seems useless. – PeterHe Aug 16 '19 at 15:32
  • @PeterHe sure I will delete that. – Jojo Aug 16 '19 at 15:33
  • 1
    That doesn't guarantee that the Murmur3 hash will be unique. The number of possible combinations of your data far exceeds the number of Murmur3 hash codes. By the [Pigeonhole principle](https://en.wikipedia.org/wiki/Pigeonhole_principle), you *will* get collisions at some point. With a 32-bit hash code, the likelihood of getting a collision approaches 50% after only about 72,000 items. With only 200,000 items, a collision is almost guaranteed. Using a hash code for a primary key is an exceptionally bad idea. – Jim Mischel Aug 16 '19 at 17:38
  • @JimMischel Sure makes sense, but then what you recommend? – Jojo Aug 16 '19 at 17:40
  • I typically use an auto-generated sequential value for the primary key. You can use the hash code as a secondary key if you want, but you have to take into account the possibility of multiple records hashing to the same key. – Jim Mischel Aug 16 '19 at 17:46
  • @JimMischel that was my initial thought thank you for confirming – Jojo Aug 16 '19 at 17:50

1 Answers1

1

It's never a good idea to use a hash function for a primary key.

Cassandra doesn't use it this way, it can use murmur to define the partition/shard where the record goes, which is totally different. Hash functions have collisions. It's OK for them because they are not supposed to be used to guarantee uniqueness. You get a fixed output for a variable size input. It's quite self-explanatory why collisions are fine for this class of functions.

There are different algorithms that can generate a unique ID even for a very high volume of ingestion/input. You can either create a separate GUID(different from your apps_guid) or look into more sophisticated algorithms like:

But there are others...

Also check Instagram's approach; nice article from Segment and this paper.

yuranos
  • 8,799
  • 9
  • 56
  • 65