Have a look at this other answer as well. As mentioned there, in most of the cases the _source
field helps a lot. Even though it might seem like a waste because elasticsearch effectively stores the whole document that comes in, that's really handy (e.g. when needing to update documents without sending the whole updated document). At the end of the day it hides a lucene implementation detail, the fact that you need to explicitly store fields if you want to get them back, while users usually expect to get back what they sent to the search engine. Surprisingly, the _source
helps performance wise too, as it requires a single disk seek instead of more disk seeks that might be caused by retrieving multiple stored fields. At the end of the day the _source
field is just a big lucene stored field containing json, which can be parsed in order to get to specific fields and do some work with them, without needing to store them separately.
That said, depending on your usecase (how many fields you retrieve) it might be useful to have a look at source include/exclude at the bottom of the _source field reference, which allows you to prevent parts (e.g. the sensitive parts of your documents) of the source field from being stored. That would be useful if you want to keep relying on the _source
but don't want a part of the input documents to be returned, but you do want to search against those fields, as they are going to be indexed (but not stored!) in the underlying lucene index.
In both cases (either you disable the _source
completely or exclude some parts), if you plan to update your documents keep in mind that you'll need to send the whole updated document using the index api. In fact you cannot rely on partial updates provided with the update api as you don't have in the index the complete document that you indexed in the first place, which you would need to apply changes to.