0

Trying to create a searchable dashboard for end users with full text search capability on a csv dataset containing research topics using ElasticSearch with python.

Search will return row index of the relevant csv rows. There are multiple columns namely _id, topic

If I try to query the dataset for "cyber security". I get most of the results containing words "cyber security" or "cyber-security" but there are other rows returned which deal with food security and army security. How to avoid this for a general search term?

Moreover search term "cyber" or "cyber security" does not pick up some topics containing words like "cybersecurity" or "cybernetics"

How would I go about writing a condition which can capture these? Do keep in mind that this needs to work the other way too i.e if I search for "food security" the cyber topics shouldn't come up.

def test_search():
    client = Elasticsearch()
    q = Q("multi_match", query='cyber security',
          fields=['topic'],
          operator='or')
    s = Search(using=client, index="csvfile").query(q) \

    # .filter('term', name="food")
    # .exclude("match", description="beta")

EDIT: Adding a sample requirement as requested in comments

The csv file can be as given below.

_id,topic
1,food security development in dairy
2,securing hungry people by providing food
3,cyber security in army
4,bio informatics for security
5,cyber security in the world
6,food security in the world
7,cyberSecurity in world
8,army security in asia
9,cybernetics in the world
10,cyber security in the food industry.
11,cyber-information
12,cyber security 
13,secure secure army man
14,crytography for security
15,random stuff

Acceptable

Search term is cyber -> 3,5,7,9,10,11,12
Search term is security -> everything except 11,14,15
Search term is cyber security or cybersecurity -> 3,5,7,9,10,11,12 (in this case cyber needs to have a higher priority, user won't be interested in other security types)
Search term is food security ->1,2

Perfect Case
Search term is cyber or cyber security or cybersecurity-> 3,4,5,7,9,10,11,12,14

considering Cryptography and Bio Informatics are pretty much cyber security related, should I be using clustering of documents to achieve this (ML techniques)?

shinz4u
  • 145
  • 1
  • 9

1 Answers1

2

This is a normal "full text" search behavior. In Elasticsearch, text fields are analysed. The standard analyser simply tokenizes the String and convert all tokens to lower case before adding them to the inverted index. When you index "food security", "cyber security", "cyber-security", "army security", "cybersecurity" and "cybernetics" the inverted index looks like this:

"food" -> ["food security"]
"cyber" -> ["cyber security", "cyber-security"]
"army" -> ["army security"]
"security" -> ["food security", "cyber security", "cyber-security", "army security"]
"cybersecurity" -> ["cybersecurity"]
"cybernetics" -> ["cybernetics"]

Then when you search for "food security", the search String is analysed to ["food", "security"]. All entries in the inverted index for "food" and "security" will match, namely: ["food security", "cyber security", "cyber-security", "army security"]. On the other hand a search for "cybersecurity" will only match with "cybersecurity".


EDIT: approaching solution

There are several distinct "features" in your requirements:

  • security must match with secure and securing. This can be achieved with an english analyzer that will group together all inflected forms of a word.
  • cybersecurity must match with cyber, cybernetics , etc. This can be achieved with an ngram analyser
  • when searching for cyber security, do not match with food security. This can be achieved with common terms queries by setting a proper cutoff_frequency
  • match words that are semantically close (eg "cybersecurity" and "cryptography"). This cannot be achieved with Elasticsearch as far as I know.

Grouping everything together, we can come up with the following mapping (see this post for explanations about custom mapping)

{
  "mappings": {
    "_doc": {
      "properties": {
        "id": {
          "type": "keyword",
          "ignore_above": 256
        },
        "topic": {
          "type": "text",
          "analyzer": "english",
          "fields": {
            "fourgrams": {
              "type": "text",
              "analyzer": "fourgrams"
            }
          }
        }
      }
    }
  },
  "settings": {
    "analysis": {
      "filter": {
        "fourgrams_filter": {
          "type": "ngram",
          "min_gram": 4,
          "max_gram": 4
        }
      },
      "analyzer": {
        "fourgrams": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "fourgrams_filter"
          ]
        }
      }
    }
  }
}

and the following search query

GET topics/_search 
{
  "size": 20,
  "query": {
    "bool": {
      "should": [
        {
          "common": {
            "topic": {
              "query": "cyber security",
              "cutoff_frequency": 0.3,
              "boost": 2
            }
          }
        },
        {
          "match": {
            "topic.fourgrams": "cyber security"
          }
        }
      ]
    }
  }
}

You will still have false negatives, but hopefully they will be sorted in the expected order so that you can filter out lower scores.

Benoit Guigal
  • 838
  • 1
  • 10
  • 24
  • Yes, I did find out that this is the normal behaviour. But how would I go about changing it? The built in analyzer will output the current behaviour. What would be a process that I need to follow in order to customize this for a general case? Are there any other built in analyzers which would do this? – shinz4u Aug 13 '18 at 05:37
  • I think part of the behavior you are looking for can be achieved with keyword datatype https://www.elastic.co/guide/en/elasticsearch/reference/current/keyword.html and wildcard query https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-wildcard-query.html. Can you maybe edit your question specifying which documents you are indexing and which matches you expect for specific search term so that I can try to formulate a proper answer. Best – Benoit Guigal Aug 13 '18 at 06:32
  • I assume this would be a custom search is it? This is almost bordering on document classification? False positives might mostly arise i think. Not so ideal when trying to make plots / searchable dashboards – shinz4u Aug 13 '18 at 10:59
  • I have tried this out. Seems to be working in a better fashion in testing but not in the python space due to "topic.fourgrams" . Could you tell me why you are doing the search i.e ' "match": { "topic.fourgrams": "cyber security" } ' can the topic.fourgrams json search criteria be expanded? I am not able to pass this as is in python elasticsearch-dsl. – shinz4u Aug 28 '18 at 11:17