Elasticsearch _id as MD5 hash or document fields

Question

There are some examples available on the internet to customize _id field for a Elasticsearch document but is there a way to generate a composite _id of multiple fields.

Sample Data

{
  "first_name": "john",
  "last_name": "doe",
  "dob": "1987-12-21",
  "phone": "7894456123".
  "so": "on"...
}

How can I configure the index pipeline to generate _id from the join of first the 4 fields which for the use-case considered to be the composite primary key.

Things to take care:

There is character limit on _id but the join of the 4 fields can exceed that anytime.
using some kind of separate so there can't be 2 docs with different fields value but same joined value.

I considered using hashing algo like MD5 and SHA256 which can generate fixed length _ids from the "|".join(first,last,dob,phone). but not able to implement in the ingestion pipeline

This is not a security concern as we only trying to define a primary key and indexes are on a monthly rolling bases.

So if we can find a storage efficient _id value that is preferred.

if there are other ways to achieved the use-case please suggest.

Val · Accepted Answer · 2021-11-05T08:49:26.017

4

Enter the fingerprint ingest processor (since ES 7.12.0).

You can define an ingest pipeline using that processor and set the _id field as you expect:

PUT _ingest/pipeline/id-fingerprint
{
  "processors": [
    {
      "fingerprint": {
        "fields": ["first_name", "last_name", "dob", "phone"],
        "target_field": "_id",
        "method": "MD5"
      }
    }
  ]
}

Then when you index your document, you can simply reference that pipeline

PUT test/_doc/1?pipeline=id-fingerprint
{
  "first_name": "john",
  "last_name": "doe",
  "dob": "1987-12-21",
  "phone": "7894456123",
  "so": "on"
}

Results =>

{
    "_index" : "test",
    "_type" : "_doc",
    "_id" : "Xu28Onz3lbYCG0DrTTVp6Q==",      <--- the generated ID
    "_source" : {
      "phone" : "7894456123",
      "dob" : "1987-12-21",
      "last_name" : "doe",
      "so" : "on",
      "first_name" : "john"
    }
  }

edited Nov 05 '21 at 08:49

answered Nov 05 '21 at 08:42

Val

207,596
13
358
360

It is not handling if one document a field is indexed as integer and second as string with same value. The index has mapping created for the field to state it as integer but it still created 2 docs one with phone number as integer and other with string. Do you know the fix of that? – Jugraj Singh Nov 05 '21 at 14:36
You should make sure to streamline your indexing process to always index the same way – Val Nov 05 '21 at 14:39
Is that the only way?. The issue is the application aggregates data from multiple sources and there can be a chance something like this passes through if the indexer is directly calling ES APIs. Anyway another reason to upgrade. – Jugraj Singh Nov 05 '21 at 14:56
1

I would add a few [`convert` processors](https://www.elastic.co/guide/en/elasticsearch/reference/current/convert-processor.html) above the `fingerprint` one, to make sure everything is transformed to the proper type before it reaches the fingerprinting. – Val Nov 05 '21 at 15:25

Elasticsearch _id as MD5 hash or document fields

1 Answers1