1

I am trying to match a query based on URL fields. I have an my InsertLink method below that is fired off when someone adds a new link on the webpage. Right now if any link is to be added in with the prefix "https://" or "http://" it automatically matches the first (and in this case only) item with the https:// or "http://" prefix in the index. Is this because of the way my model is setup with the Uri type? Here is an example of my model and a screenshot of a debug for the InsertLink method.

My Model:

public class SSOLink
{
    public string Name { get; set; }
    public Uri Url { get; set; }
    public string Owner { get; set; }

}

An example of the screenshot.

ElasticSearch Debug

Parakoopa
  • 505
  • 1
  • 9
  • 22

1 Answers1

2

You need to use the UAX_URL tokenizer to search on URL fields.

You can create your custom analyzer using UAX_URL token as and use the same match query which you are using as of now to get the expected result.

Index mapping

{
    "settings": {
        "analysis": {
            "analyzer": {
                "my_analyzer": {
                    "tokenizer": "my_tokenizer"
                }
            },
            "tokenizer": {
                "my_tokenizer": {
                    "type": "uax_url_email",
                    "max_token_length": 5
                }
            }
        }
    },
    "mappings": {
        "properties": {
            "url": {
                "type": "text",
                "analyzer": "my_analyzer"
            }
        }
    }
}

Looks like in your case, the URL field is using the text field in Elasticsearch, which uses the standard analyzer and using _analyze API, you can check the tokens generated by your URL field.

Using standard analyzer

POST _analyze/

{
    "text": "https://www.microsoft.com",
    "analyzer" : "standard"
}

Tokens

{
    "tokens": [
        {
            "token": "https",
            "start_offset": 0,
            "end_offset": 5,
            "type": "<ALPHANUM>",
            "position": 0
        },
        {
            "token": "www.microsoft.com",
            "start_offset": 8,
            "end_offset": 25,
            "type": "<ALPHANUM>",
            "position": 1
        }
    ]
}

Using UAX_URL tokenizer

{
    "text": "https://www.microsoft.com",
    "tokenizer" : "uax_url_email"
}

And generated tokens

{
    "tokens": [
        {
            "token": "https://www.microsoft.com",
            "start_offset": 0,
            "end_offset": 25,
            "type": "<URL>",
            "position": 0
        }
    ]
}
Amit
  • 30,756
  • 6
  • 57
  • 88