0

Imagine I have an ElasticSearch instance with three kinds of data - author, publisher and book - all in JSON. Author data looks like this:

{
  "document-id": "XYZ",
  "document-type": "author",
  "name": "John Doe",
  "country": "Canada"
}

, publisher data looks like this:

{
  "document-id": "JKL",
  "document-type": "publisher",
  "name": "Random House"
}

, and book data looks like this:

{
  "document-id": "ABC",
  "document-type": "book",
  "authorId": "XYZ",
  "publisherId": "JKL",
  "title": "Logstash for Dummies"
}

As of now, each goes into its own index.

I would like to create a denormalized version of the data, so that I can easily search for all books written by Canadian authors, or published by Random House. I need to support updates to the author, publisher and book data, so that if the author moves to a new country or changes their name, the denormalized copy will also be updated.

I also need to keep all fields from all objects in the denormalized copy (i.e., avoid collisions between the two document-id fields, so that both document-id values are present, even if one has to be renamed; same goes for publisher.name and author.name). And all this will be used in Kibana reports which, as I understand it, doesn't have great support for nested objects, though it does seem to have some support, which might eliminate my field-name-collision concerns.

What's the best way to achieve this? I've seen discussions that lead me towards the Logstash aggregate filter, or the ElasticSearch output plugin, and I'm unsure what to pursue. Is Logstash even necessary, or is this possible with ingest pipelines?

Do both document types need to be in the same index in order for this to work? And should book be "enriched" with author and publisher data, or should they all be combined into yet a fourth document type?

I'm an ElasticSearch novice, and a complete newcomer to Logstash, so I'd appreciate any guidance you can provide.

Thanks!

(Cross-posted from https://discuss.elastic.co/t/enrich-one-document-with-fields-from-another/208651, after not receiving a response there after five days.)

rweiser
  • 319
  • 1
  • 3
  • 13
  • Can you modify your sample documents to also include the publisher data? There's no single/simple answer to your question, but I'll give it a shot... – Val Nov 25 '19 at 08:43
  • Updated. Thanks in advance for your efforts to help! – rweiser Nov 25 '19 at 09:54

1 Answers1

2

The best way is indeed to try and denormalize all your data into "book" documents. However, it's not always ideal especially if the merged data (author, publishers) can change over time, in which case you'd need to update all related book documents (usually using the update by query API) and potentially have to rebuild the full index every time an author or publisher changes.

As I said, there's neither a simple answer nor a right or wrong one. It always depends on the use cases you want to support.

In your case, you can get away by doing two queries, one to fetch IDs from the author and publisher indexes and then use those IDs to query the book index.

For instance, if you need to retrieve all books from Canadian authors published by Random House, you'd first issue the following query:

POST _msearch
{ "index" : "authors" }
{ "query" : { "term": { "country": "Canada" }}}
{ "index" : "publisher" }
{ "query" : { "term": { "name": "Random House" }}}

And then using the ids of the retrieved documents, you can issue a query to the book index:

POST books/_search
{
  "query": {
    "bool": {
      "filter": [
        {
          "terms": {
            "authorId": ["XYZ1", "XYZ2", "XYZ3"]
          }
        },
        {
          "terms": {
            "publisherId": ["ABC1"]
          }
        }
      ]
    }
  }
}

In your case, I would clearly not go the parent/child or nested way, there'd be too much overhead to keep everything in sync.

Val
  • 207,596
  • 13
  • 358
  • 360
  • Thanks, Val. But given that the ultimate goal here is Kibana visualizations / reports, is that two-step process still relevant? What if I wanted to create a data table of the average price of books by Canadian authors, aggregated by publisher (i.e., each row in the table represents one publisher, and contains the average price of all of that publisher's books by Canadian authors)? – rweiser Nov 25 '19 at 10:18
  • And why would I "potentially have to rebuild the full index every time an author or publisher changes"? – rweiser Nov 25 '19 at 10:19
  • 1
    Yes, indeed, that not ideal for the Kibana use case. Then you need to denormalize everything and store all the necessary author/publisher data in book documents. The "rebuild everything" was more meant in the case you'd miss some updates for whatever reasons (and believe me that will happen) – Val Nov 25 '19 at 10:21
  • Thanks. And - back to the original question - is there any way Logstash or ingest pipelines help me with this? So that whenever a new book is loaded, it is automatically enriched with author and publisher data? Or am I better off just enhancing my (Java-based) book-loading code to fetch and add the author and publisher on its own? – rweiser Nov 25 '19 at 10:26
  • 1
    Yes, Logstash can help using the [`jdbc_static` and `jdbc_streaming` filters](https://stackoverflow.com/a/58832270/4604579). However, this only works for new books coming in. It won't help you when author/publisher data changes and you need to update all existing book records already stored in the books index. Unless.... you also flag all impacted books in your DB so that they will be picked up again by Logstash for reindexing, in which case you also need to cleverly choose the expiration strategy in the `jdbc_static` and `jdbc_streaming` filters. – Val Nov 25 '19 at 10:33
  • Do those plugins work with MongoDB (which is the source of the book data and author data; the publisher data comes from Postgres)? – rweiser Nov 25 '19 at 10:41
  • Yes, using a compatible MongoDB JDBC driver – Val Nov 25 '19 at 10:43
  • Also see my answer to your [other question](https://stackoverflow.com/a/59047019/4604579) which could pretty much work in this case, too. – Val Nov 26 '19 at 08:53
  • I haven't had a chance to try it yet, and we may go with a different approach altogether. In any case, thanks for your help on this issue! – rweiser Nov 26 '19 at 19:46