0

I have gotten into a design problem and I thought I would ask you for advice.

I am currently indexing information from different services by polling their apis. And out of that data I am constructing a tailored model for use in my own service.

The problem I have gotten into is how my IDs should look like. The services provide an ID for each element in their collections(which is good), but on my end I dont think I want to use the external id as the identifier on my documents. What if two services has duplicate ids? How should I handle this? I am thinking of just adding a single character to the ids(this is a problem because I want the ids to be numeric), taken from the name of the polled service? Or should I just create unique ids of my own?

I am using ElasticSearch as datastore.

Thanks,

James Ford

James Ford
  • 949
  • 4
  • 12
  • 25

1 Answers1

2

I can think of three ways to handle this:

  1. Introduce a new key representing the source of the data to avoid collisions. So you have in a document in Elastic Search an API ID (1, 2, 3, etc.) and then the entity ID they've provided. All queries would use both the API ID and the entity ID.

  2. Add a large number to the IDs to space them out in a new global space. Just add something like 1 trillion to every ID and then they all get their own space for IDs. Obviously the trick here is to predict how much the data can grow. (You don't want collisions in the future.)

  3. Create your own auto-increment on new entities that get mapped to your tailored model.

Whichever one you pick I would recommend keeping the original ID in case you ever need to map it back to the source API.

ryan1234
  • 7,237
  • 6
  • 25
  • 36
  • 1
    Instead of adding a number to each ID, I would suggest to multiply with a multiple of 10 or 2. When you use a multiple of 10, the original ID can be seen at first glance. When you use a multiple of 2, object-ID and service-ID can be easier separated programmatically by bit-shifting and bitwise-AND. – Philipp Jan 21 '13 at 07:45
  • Thanks for your answers, but I don't see how suggestion number 2 would avoid collisions? Do you mean a different large number for each API? Suggestion number 1 would work but then I would need some complex API ID? And I would not want to implement suggestion number 3 because of the fact that I cant use serials. And before I would index a document I would have to search for it, in order to know if I should increment or not and if it is a document that is already indexed? – James Ford Jan 21 '13 at 12:41
  • 1
    Yeah consider what Philipp said as a follow up. Image every API has auto incremented ids starting with 1. If you have API 1, 2, 3, just multiply the first API by 1 million, the second API ID by 2 million, the third by 3 million etc. This way the ids are spaced out by 1 million in between. Maybe that makes more sense? – ryan1234 Jan 21 '13 at 17:14
  • Yeah that makes more sense and it will work. I will need a space of a billion I guess. Thanks! – James Ford Jan 21 '13 at 20:24