Merging data from different sources at index time

Question

I have two file crawler jobs running separately on data which are related to each other using fscrawler(https://github.com/dadoonet/fscrawler). Now I want to in some way merge the data together when indexing(child-parent relation or flat document is OK), so some middleware is needed. Looking at both Logstash and the new Ingest Node feature in ES 5.0, none seem to support writing custom processors.

Are there any possibilities to do this sort of merging/relational mapping at index time? Or do I have to do post-processing instead?

EDIT: One job crawls "articles" in json-format. Articles can have multiple attachments (declared in an attachment array in the json), in a different location. The second job crawls the actual attachments(e.g pdf...), applying TIKA processing on it. In the end I would like to have one article type, which also contains the content of the attachments.

Can you describe with a bit more details what kind of data your two crawlers are sending and what you'd like to obtain in the end? — Val, Oct 14 '16 at 13:18
This doesn't really sound like a question about elasticsearch - sounds like you just need a strategy for fetching data from two sources and constructing documents out of it... — Ant P, Oct 14 '16 at 13:43
@Ant P Perhaps, but since Ingest Node does preprocessing of data before indexing I thought this issue would be a similar . — frods, Oct 14 '16 at 13:53

score 1 · Accepted Answer · answered Oct 16 '16 at 20:31

1

If you loaded both documents into different ES indexes, you could have an LS input that would look for articles that didn't (yet) contain the content of the attachment. For those documents, you could query the other elasticsearch index (see the elasticsearch{} filter in LS) and update the article document.

answered Oct 16 '16 at 20:31

Alain Collins

16,268
2
32
55

Yes, looks like post processing is necessary. Did not know of the elasticsearch filter, thanks! Marked as answer. – frods Oct 17 '16 at 06:34

Merging data from different sources at index time

1 Answers1