The elasticsearch docs mention the following (https://www.elastic.co/guide/en/elasticsearch/reference/current/search-profile.html#rewrite-section)
The rewriting process is complex and difficult to display, since queries can change drastically. Rather than showing the intermediate results, the total rewrite time is simply displayed as a value (in nanoseconds). This value is cumulative and contains the total time for all queries being rewritten.
I am using a has_child
query and it's slow. The docs mention it is slow, but I want to figure out why!
Elasticsearch 7 mapping:
The form_entries
are "double" joined. We're going to query form
, so there is only one level of has_child
.
{
"mappings": {
"dynamic": "strict",
"properties": {
"pseudo_id": {
"type": "keyword"
},
"form": {
"dynamic": "strict",
"properties": {
"id": {
"type": "keyword",
"eager_global_ordinals": true
},
"start_date": {
"type": "date",
"format": "strict_date_optional_time||epoch_second"
}
}
},
"form_entries": {
"dynamic": "strict",
"properties": {
"id": {
"type": "keyword",
"eager_global_ordinals": true
},
"form_id": {
"type": "keyword",
"eager_global_ordinals": true
},
"start_date": {
"type": "date",
"format": "strict_date_optional_time||epoch_second"
}
}
},
"patient_joins": {
"type": "join",
"eager_global_ordinals": true,
"relations": {
"_doc": [
"form"
],
"form": "form_entries"
}
}
}
}
}
Index stats:
forms: 14 million
form_entries: 200 million
Query profile
{
"index": "main",
"size": 20,
"routing": "<pseudo id>",
"body": {
"profile": true,
"query": {
"constant_score": {
"filter": {
"bool": {
"must": [
{
"term": {
"pseudo_id": "<pseudo id>"
}
},
{
"exists": {
"field": "form.id"
}
},
{
"has_child": {
"type": "form_entries",
"query": {
"match_all": {} // <--- Note: We're not even filtering
},
"min_children": 1,
"inner_hits": {
"size": 1
}
}
}
]
}
}
}
},
"sort": [
"pseudo_id",
{
"form.start_date": {
"order": "desc"
}
},
{
"form.id": {
"order": "asc"
}
}
],
"_source": true
}
}
Relevant profile outcomes:
"profile": {
"shards": [
{
"searches": [
{
"query": [
{
"type": "ConstantScoreQuery",
"description": "ConstantScore(+pseudo_id:<pseudo id> +ConstantScore(DocValuesFieldExistsQuery [field=form.id]) +(+form.description.raw:consult +GlobalOrdinalsQuery{joinField=patient_joins#form}))",
"time_in_nanos": 655400,
"breakdown": {
...
],
"rewrite_time": 378342100,
"collector": [
{
"name": "SimpleFieldCollector",
"reason": "search_top_hits",
"time_in_nanos": 279800
}
]
}
Query time: <1ms
Rewrite time: ~378ms
?
Question:
So why and/or what does Lucene need to rewrite for has_child
? Can additional profile options be used? Can rewrite be disabled?
Semi related: If we reduce the data-set to 50K forms, the query time remains the same, but the rewrite
is much faster.