0

I have a custom class in ES 2.5 of the following:

Title
DataSources
Content

Running a search is fine, except with the middle field - it's built/indexed using a delimiter of '|'.

ex: "|4|7|8|9|10|12|14|19|20|21|22|23|29|30"

I need to build a query that matches some in all fields AND matches at least one number in the DataSource field.

So to summarize what I currently have:

    QueryBase query = new SimpleQueryStringQuery
    {
        //DefaultOperator = !operatorOR ? Operator.And : Operator.Or,
        Fields = LearnAboutFields.FULLTEXT,
        Analyzer = "standard",
        Query = searchWords.ToLower()
    };
    _boolQuery.Must = new QueryContainer[] {query};

That's the search words query.

    foreach (var datasource in dataSources)
    {
        // Add DataSources with an OR
        queryContainer |= new WildcardQuery { Field = LearnAboutFields.DATASOURCE, Value = string.Format("*{0}*", datasource) };
    }
    // Add this Boolean Clause to our outer clause with an AND
    _boolQuery.Filter = new QueryContainer[] {queryContainer};
}

That's for the datasources query. There can be multiple datasources.

It doesn't work, and returns on results with the filter query added on. I think I need some work on the tokenizer/analyzer, but I don't know enough about ES to figure that out.

EDIT: Per Val's comments below I have attempted to recode the indexer like this:

        _elasticClientWrapper.CreateIndex(_DataSource, i => i
            .Mappings(ms => ms
                .Map<LearnAboutContent>(m => m
                    .Properties(p => p
                        .String(s => s.Name(lac => lac.DataSources)
                            .Analyzer("classic_tokenizer")
                            .SearchAnalyzer("standard")))))
            .Settings(s => s
                .Analysis(an => an.Analyzers(a => a.Custom("classic_tokenizer", ca => ca.Tokenizer("classic"))))));
        var indexResponse = _elasticClientWrapper.IndexMany(contentList);

It builds successfully, with data. However the query still isn't working right.

New query for DataSources:

        foreach (var datasource in dataSources)
        {
            // Add DataSources with an OR
            queryContainer |= new TermQuery {Field = LearnAboutFields.DATASOURCE, Value = datasource};
        }
        // Add this Boolean Clause to our outer clause with an AND
        _boolQuery.Must = new QueryContainer[] {queryContainer};

And the JSON:

{"learnabout_index":{"aliases":{},"mappings":{"learnaboutcontent":{"properties":{"articleID":{"type":"string"},"content":{"type":"string"},"dataSources":{"type":"string","analyzer":"classic_tokenizer","search_analyzer":"standard"},"description":{"type":"string"},"fileName":{"type":"string"},"keywords":{"type":"string"},"linkURL":{"type":"string"},"title":{"type":"string"}}}},"settings":{"index":{"creation_date":"1483992041623","analysis":{"analyzer":{"classic_tokenizer":{"type":"custom","tokenizer":"classic"}}},"number_of_shards":"5","number_of_replicas":"1","uuid":"iZakEjBlRiGfNvaFn-yG-w","version":{"created":"2040099"}}},"warmers":{}}}

The Query JSON request:

{
  "size": 10000,
  "query": {
    "bool": {
      "must": [
        {
          "simple_query_string": {
            "fields": [
              "_all"
            ],
            "query": "\"housing\"",
            "analyzer": "standard"
          }
        }
      ],
      "filter": [
        {
          "terms": {
            "DataSources": [
              "1"
            ]
          }
        }
      ]
    }
  }
}
Val
  • 207,596
  • 13
  • 358
  • 360
Michael
  • 507
  • 5
  • 20

2 Answers2

3

One way to achieve this is to create a custom analyzer with a classic tokenizer which will break your DataSources field into the numbers composing it, i.e. it will tokenize the field on each | character.

So when you create your index, you need to add this custom analyzer and then use it in your DataSources field:

PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "number_analyzer": {
          "type": "custom",
          "tokenizer": "number_tokenizer"
        }
      },
      "tokenizer": {
        "number_tokenizer": {
          "type": "classic"
        }
      }
    }
  },
  "mappings": { 
    "my_type": {
      "properties": {
        "DataSources": {
          "type": "string",
          "analyzer": "number_analyzer",
          "search_analyzer": "standard"
        }
      }
    }
  }
}

As a result, if you index the string "|4|7|8|9|10|12|14|19|20|21|22|23|29|30", you DataSources field will effectively contain the following array of token: [4, 7, 8, 9, 10, 12, 14, 191, 20, 21, 22, 23, 29, 30]

Then you can get rid of your WildcardQuery and simply use a TermsQuery instead:

terms = new TermsQuery {Field = LearnAboutFields.DATASOURCE, Terms = dataSources }
// Add this Boolean Clause to our outer clause with an AND
_boolQuery.Filter = new QueryContainer[] { terms };
Val
  • 207,596
  • 13
  • 358
  • 360
  • Can you provide the NEST equivalent to the JSON functions you specified above? And is there anyway to just annotate the custom class? – Michael Jan 08 '17 at 07:16
  • You can find a sample NEST equivalent here: http://stackoverflow.com/questions/25193800/creating-a-custom-analyzer-in-elasticsearch-nest-client/25219676#25219676 – Val Jan 08 '17 at 07:19
  • Thanks Val, the NEST example appears to be out of date. Is there a more current version (2.5 at least)? – Michael Jan 09 '17 at 16:42
  • Nice, can you run this and edit your question with the results: `curl -XGET localhost:9200/index_name` – Val Jan 09 '17 at 19:04
  • Edited as requested. – Michael Jan 09 '17 at 19:08
  • You also need to set the search_analyzer on your `DataSources` using `SearchAnalyzer("standard")` – Val Jan 09 '17 at 19:15
  • Did you make sure to delete your index and repopulate it? – Val Jan 09 '17 at 21:26
  • Yes of course. Just updated the JSON in post to reflect changes. – Michael Jan 10 '17 at 00:34
  • Ok everything works ok on my side for a few OR'ed `term` queries (even though you should use a `terms` one like in my answer), now can you show the generated JSON query that you get? – Val Jan 10 '17 at 04:06
  • Updated the code to use terms instead - no dice. I updated the post with the request JSON - maybe something wrong with my main keyword part? – Michael Jan 10 '17 at 05:07
  • Oh wait, the mapping has a field `dataSources` and in your query you have `DataSources`, that's probably the issue. – Val Jan 10 '17 at 05:19
  • dataSources is a string list of search objects. The field in the POCO object is DataSources. That looks correct to me... – Michael Jan 10 '17 at 05:24
  • No, I'm not talking about the `dataSources` variable in your code, but the field you use in your query (i.e. `DataSources`) must be named exactly the same way as the field you have in your mapping (i.e. `dataSources`). Compare the JSON query with your JSON mapping and you'll see what I'm talking about. – Val Jan 10 '17 at 05:26
  • That did it! Thanks for all your help! – Michael Jan 10 '17 at 05:37
  • Hey. Just asked a similar question (this answer didn't help) that maybe you can help with? http://stackoverflow.com/questions/42546040/custom-tab-tokenizer-in-elasticsearch-nest-2-4 – Michael Mar 02 '17 at 03:14
  • I'll check but feel free to tell Russ Cam that you need more info. – Val Mar 02 '17 at 05:05
1

At an initial glance at your code I think one problem you might have is that any queries placed within a filter clause will not be analysed. So basically the value will not be broken down into tokens and will be compared in its entirety.

It's easy to forget this so any values that require analysis need to be placed in the must or should clauses.

GWilkinson
  • 107
  • 1
  • 11
  • I was afraid of that, but I know that even if I switch it back over to being a part of the query it wouldn't matter - still doesn't work. – Michael Jan 05 '17 at 12:37