3

I have indexed 2 million documents and I am trying to return all the matching document ids at once. and I use a PHP client.

My mapping is as follows:

$params = [
    'index' => $index,
    'body' => [
        'settings' => [
            "number_of_shards" => 1,
            "number_of_replicas" => 0,
            "index.queries.cache.enabled" => false,
            "index.soft_deletes.enabled" => false,
            "index.refresh_interval" => -1,
            "index.requests.cache.enable" => false,
            "index.max_result_window"=> $result_window
        ],
        'mappings' => [
            '_source' => [
                "enabled" => false
             ],
             'properties' => [
                "text" => [
                        "type" => "text",
                        "index_options" => "docs"
                ]
        ]
     ]
    ]
];

My query string is as follows:

$json = '{
"from" : 0, "size" : '.$size.',
        "profile": true,
"query": {
    "bool": {
      "filter" : {
        "match" : {
            "text" : {
            "query" : "justin trump clinton harry",
            "operator" : "and"
            }
        }
    }
}
}
}';

My profile API output is as follows:

Array                                                                                                                                  
(                                                                                                                                 
    [shards] => Array                                                                                                
        (                                                                                                               
            [0] => Array                                                                                                    
                (                                                                                                   
                    [id] => [tod2gbVKSRGinZVfdXTmxA][elasticindex-2][0]                                 
                    [searches] => Array                                                                          
                        (                                                                                     
                            [0] => Array                                                                                  
                                (                                                                                   
                                    [query] => Array                                                             
                                        (                                                                            
                                            [0] => Array                                                       
                                                (                                                                    
                                                    [type] => BoostQuery                                            
                                                    [description] => (ConstantScore(+text:justin +text:trump +text:clinton +text:harry))^0.0
                                                    [time_in_nanos] => 176108294                                                  
                                                    [breakdown] => Array                                           
                                                        (                                                               
                                                            [set_min_competitive_score_count] => 0                          
                                                            [match_count] => 0                                    
                                                            [shallow_advance_count] => 0                
                                                            [set_min_competitive_score] => 0                          
                                                            [next_doc] => 158666901                           
                                                            [match] => 0                                                  
                                                            [next_doc_count] => 439522                              
                                                            [score_count] => 439522                              
                                                            [compute_max_score_count] => 0                           
                                                            [compute_max_score] => 0                            
                                                            [advance] => 262234                                                   
                                                            [advance_count] => 1                                    
                                                            [score] => 14477781                                         
                                                            [build_scorer_count] => 2                                             
                                                            [create_weight] => 401058                              
                                                            [shallow_advance] => 0                                      
                                                            [create_weight_count] => 1                                      
                                                            [build_scorer] => 1421272                         
                                                        )                                                                 

                                                    [children] => Array                                          
                                                        (                                                                 
                                                            [0] => Array
(                                                                                       
                                                                    [type] => BooleanQuery
                                                                    [description] => +text:justin +text:trump +text:clinton +text:harry
                                                                    [time_in_nanos] => 128547273
                                                                    [breakdown] => Array
                                                                        (
                                                                            [set_min_competitive_score_count] => 0
                                                                            [match_count] => 0
                                                                            [shallow_advance_count] => 0
                                                                            [set_min_competitive_score] => 0
                                                                            [next_doc] => 126071813
                                                                            [match] => 0
                                                                            [next_doc_count] => 439522
                                                                            [score_count] => 0
                                                                            [compute_max_score_count] => 0
                                                                            [compute_max_score] => 0
                                                                            [advance] => 260695
                                                                            [advance_count] => 1
                                                                            [score] => 0
                                                                            [build_scorer_count] => 2
                                                                            [create_weight] => 373620
                                                                            [shallow_advance] => 0
                                                                            [create_weight_count] => 1
                                                                            [build_scorer] => 1401619
                                                                        )

                                                                    [children] => Array
                                                                        (
                                                                            [0] => Array
                                                                                (
                                                                                    [type] => TermQuery
                                                                                    [description] => text:justin
                                                                                    [time_in_nanos] => 40691947

                                                                                )

                                                                            [1] => Array
                                                                                (
                                                                                    [type] => TermQuery
                                                                                    [description] => text:trump
                                                                                    [time_in_nanos] => 42972729
                                                                                )

                                                                            [2] => Array
                                                                                (
                                                                                    [type] => TermQuery
                                                                                    [description] => text:clinton
                                                                                    [time_in_nanos] => 29407195

                                                                                )

                                                                            [3] => Array
                                                                                (
                                                                                    [type] => TermQuery
                                                                                    [description] => text:harry
                                                                                    [time_in_nanos] => 33799904

                                                                                )

                                                                        )

                                                                )

                                                        )

                                                )

                                        )

                                    [rewrite_time] => 260704
                                    [collector] => Array
                                        (
                                            [0] => Array
                                                (
                                                    [name] => SimpleTopScoreDocCollector
                                                    [reason] => search_top_hits
                                                    [time_in_nanos] => 116380511
                                                )
                                        )

                                )

                        )

                )

        )

)

The goal is to get all the matching documents at once. I need only document ids (check whether the given term exists in a document or not only) so I used index_options as docs. I understand about scroll API but I want to use max_result_window. I am using only one shard, no replicas and I also avoided scoring of documents when I perform search operation.

My questions are as follows:

  1. I want to retrieve only document ids and avoid document fetch phase, so I disabled the source field. To avoid other metadata, I tried "stored_fields": "none", "docvalue_fields": ["_id"] as per this link to avoid the fetch phase. But I can still see document type and index name. Is there anything I need to do to get only document ids and avoid the fetch phase?

  2. Since I am retrieving all the matching documents scoring is irrelevant to me so I used filter clause but I was wondering why I am getting boostquery timing in profile API results below?. But you can also note that Booleanquery score timing is zero!

  3. In order to know how much time Boolean query search took on Lucene index alone, should I just take the time reported by the Boolean query or do I need to add up all its children (term query) timings? Because when I add all those term query timings it is higher than the one reported by Boolean query. Any possible reason for this?

  4. Do I need to include collector as well for my Boolean query timing, Because in profile api , it is said that "Lucene works by defining a "Collector" which is responsible for coordinating the traversal, scoring, and collection of matching documents. ". It also says that " It should be noted that Collector times are independent from the Query times. They are calculated, combined, and normalized independently! Due to the nature of Lucene’s execution, it is impossible to "merge" the times from the Collectors into the Query section, so they are displayed in separate portions". As for my understanding, it helps in traversing the postings list of Lucene index to execute Boolean query operation. Am I right in this regard?

  5. Is there any similar API for investigating the indexing time in Elasticsearch?. I was able to get indexing time in settings API but I am looking for something similar to profile API?

halfer
  • 19,824
  • 17
  • 99
  • 186

0 Answers0