2

We're building a "unified" search across a lot of different resources in our system. Our index schema includes about 10 generic fields that are indexed, plus 5 which are required to identify the appropriate resource location in our system when results are returned.

The indexed fields often contain sensitive data, so we don't want them stored at all, only indexed for matching, thus we set the _source to FALSE.

I do however want the 5 ident fields returned, so is it possible to set the ident fields to store = yes, but the overall index _source to FALSE and get what I'm looking for in the results?

oucil
  • 4,211
  • 2
  • 37
  • 53

2 Answers2

3

Have a look at this other answer as well. As mentioned there, in most of the cases the _source field helps a lot. Even though it might seem like a waste because elasticsearch effectively stores the whole document that comes in, that's really handy (e.g. when needing to update documents without sending the whole updated document). At the end of the day it hides a lucene implementation detail, the fact that you need to explicitly store fields if you want to get them back, while users usually expect to get back what they sent to the search engine. Surprisingly, the _source helps performance wise too, as it requires a single disk seek instead of more disk seeks that might be caused by retrieving multiple stored fields. At the end of the day the _source field is just a big lucene stored field containing json, which can be parsed in order to get to specific fields and do some work with them, without needing to store them separately.

That said, depending on your usecase (how many fields you retrieve) it might be useful to have a look at source include/exclude at the bottom of the _source field reference, which allows you to prevent parts (e.g. the sensitive parts of your documents) of the source field from being stored. That would be useful if you want to keep relying on the _source but don't want a part of the input documents to be returned, but you do want to search against those fields, as they are going to be indexed (but not stored!) in the underlying lucene index.

In both cases (either you disable the _source completely or exclude some parts), if you plan to update your documents keep in mind that you'll need to send the whole updated document using the index api. In fact you cannot rely on partial updates provided with the update api as you don't have in the index the complete document that you indexed in the first place, which you would need to apply changes to.

Community
  • 1
  • 1
javanna
  • 59,145
  • 14
  • 144
  • 125
  • Very helpful information! Particularly relating to performance, this will start off as a small index while we ramp it up so rebuilding the index wont be too big a hit, and if we can get the performance gains of _source, whilst using exclusions rather than explicit inclusions, that would make more sense. Thanks! – oucil Sep 17 '13 at 17:42
  • No worries, just run some tests and see what fits better your usecase! – javanna Sep 17 '13 at 17:43
1

Yes, stored fields do not rely on the _source field, or vice-versa. They are separate, and changing or disabling one shouldn't impact the other.

femtoRgon
  • 32,893
  • 7
  • 60
  • 87
  • I figured but wanted to confirm, it didn't seem like the documentation was very clear on that distinction, thanks for chiming in. – oucil Sep 17 '13 at 00:08