0

My code is as under. Even though I have listed ["is","it","possible"] as stop words filter, but still am getting it in the output search. Could someone help as to why is elastic search not removing them from the input documents while indexing ?

issue_with_stop_word.csv is as under

id,qid1,qid2,question1
5,11,12,How do I recover my Facebook login password?
7,15,16,Is it possible to sleep without dreaming?
11,23,24,How easy is it to hack the login password of a Macbook Air?
12,25,26,How easy is it to hack the login password of a Macbook Air?
13,27,28,Is it possible to know who visited my Facebook profile?
15,31,32,Is it possible to know who visited my Facebook profile?
16,33,34,Is it possible to know who visited my Facebook profile?
18,37,38,Is it possible to hack someone's Facebook messages?
20,41,42,Is it possible to know who visited my Facebook profile?
29,59,60,How do I recover my Facebook password without having to reset it?
31,63,64,What are some special cares for someone with a nose that gets stuffy during the night?
32,65,66,What Game of Thrones villain would be the most likely to give you mercy?

Code is below

from elasticsearch import Elasticsearch
from elasticsearch import helpers
query='Is it possible ?'

index_name = 'sample'
doc_type = 'dummy'
content = 'content'
document = 'question'
identity = 'id'



def main():
    es = Elasticsearch('localhost:9200')
    create_indices(es, index_name)
    res = es.search(index=index_name, doc_type=doc_type,
                    body={
                        "query": {
                            "match": {
                               'content': "is it possible"
                            }
                        }
                    })
    print("%d documents found:" % len(res['hits']['hits']))
    for doc in res['hits']['hits']:
        print("%s) %s %s" % (doc['_id'], doc['_source']['content'], str(doc['_score'])))


def create_indices(es, index_name):
    bulk_data = []
    with open('issue_with_stop_word.csv', 'rb') as tsvin:
        tsvin.next()
        for row in tsvin:
            row = unicode(row, errors='replace')
            doc = str(row.split(',')[3]).strip()
            int_id = int(row.split(',')[1])
            value = dict()
            value[content] = doc
            value[identity] = int_id
            bulk_data.append(value)

    if es.indices.exists(index_name):
        print("deleting '%s' index..." % (index_name))
        res = es.indices.delete(index=index_name)
        print(" response: '%s'" % (res))
    # since we are running locally, use one shard and no replicas
    request_body = {
        "settings": {
            "number_of_shards": 1,
            "number_of_replicas": 0,
            "analysis": {
                        "filter": {
                            "my_stop": {
                                "type": "stop",
                                "stopwords": ["is","it","possible"]
                            }
                        }
            }
        }
    }

    print("creating '%s' index..." % (index_name))
    res = es.indices.create(index=index_name, body=request_body)
    print(" response: '%s'" % (res))

    # bulk index the data
    print("bulk indexing...")

    actions = [ 
        {
            "_index": index_name,
            "_type" : doc_type,
            "_id": val[identity],
            content:val[content]
        }
        for val in bulk_data
    ]
    res = helpers.bulk(es, actions, refresh = True)

if __name__ == '__main__':
    main()
thara
  • 133
  • 7
khangaroth
  • 75
  • 3
  • 13

1 Answers1

2

I may be misinterpreting your question here, but I think you may be misunderstanding the purpose of filters a little bit.

Analyzers, which filters are a part of, do not work on the actual body of the message that you send to elasticsearch before it is stored for later retrieval. What elasticsearch does is that it creates an inverted index into which it stores the individual words (or tokens) from your messages. This is what you can later search on. In order to retrieve the actual text of your document, this is stored unchanged into the _source field.

The following image from a presentation I gave a while back may help with this concept:

enter image description here

In your case, if you retrieve the actual document, you would get your unchanged input messages, however if you tried searching for "is" or "it" it you would not get any results returned.


Your issue in this case is, that you do not assign the filter you created to the field that contains your text (content) - which results in Elasticsearch using the standard analyzer and not your stopwords.

When I create the index as follows it displays the expected behavior for me:

PUT 127.0.0.1:9200/stacktest
{
    "settings": {
        "analysis": {
            "analyzer": {
                "my_analyzer": {
                    "type": "standard",
                    "stopwords": [
                        "is",
                        "it",
                        "possible"
                    ]
                }
            }
        }
    },
    "mappings": {
        "question": {
            "properties": {
                "content": {
                    "type": "text",
                    "analyzer": "my_analyzer"
                }
            }
        }
    }
}

POST 127.0.0.1:9200/_bulk
{ "index" : { "_index" : "stacktest", "_type" : "question" } }
{"id":5, "qid1":11, "qid2":12, "content": "How do I recover my Facebook login password?"}
{ "index" : { "_index" : "stacktest", "_type" : "question" } }
{"id":7, "qid1":15, "qid2":16,"content": "Is it possible to sleep without dreaming?"}
{ "index" : { "_index" : "stacktest", "_type" : "question" } }
{"id":11, "qid1":23, "qid2":24, "content": "How easy is it to hack the login password of a Macbook Air?"}
{ "index" : { "_index" : "stacktest", "_type" : "question" } }
{"id":12, "qid1":25, "qid2":26, "content": "How easy is it to hack the login password of a Macbook Air?"}
{ "index" : { "_index" : "stacktest", "_type" : "question" } }
{"id":13, "qid1":27, "qid2":28, "content": "Is it possible to know who visited my Facebook profile?"}
{ "index" : { "_index" : "stacktest", "_type" : "question" } }
{"id":15, "qid1":31, "qid2":32, "content": "Is it possible to know who visited my Facebook profile?"}
{ "index" : { "_index" : "stacktest", "_type" : "question" } }
{"id":16, "qid1":33, "qid2":34, "content": "Is it possible to know who visited my Facebook profile?"}
{ "index" : { "_index" : "stacktest", "_type" : "question" } }
{"id":18, "qid1":37, "qid2":38, "content": "Is it possible to hack someone's Facebook messages?"}
{ "index" : { "_index" : "stacktest", "_type" : "question" } }
{"id":20, "qid1":41, "qid2":42, "content": "Is it possible to know who visited my Facebook profile?"}
{ "index" : { "_index" : "stacktest", "_type" : "question" } }
{"id":29, "qid1":59, "qid2":60, "content": "How do I recover my Facebook password without having to reset it?"}
{ "index" : { "_index" : "stacktest", "_type" : "question" } }
{"id":31, "qid1":63, "qid2":64, "content": "What are some special cares for someone with a nose that gets stuffy during the night?"}
{ "index" : { "_index" : "stacktest", "_type" : "question" } }
{"id":32, "qid1":65, "qid2":66, "content": "What Game of Thrones villain would be the most likely to give you mercy?"}

Query for stopword

GET 127.0.0.1:9200/stacktest/_search
{
    "query": {
        "match": {
            "content": "is"
        }
    }
}

{
    "took": 1,
    "timed_out": false,
    "_shards": {
        "total": 5,
        "successful": 5,
        "failed": 0
    },
    "hits": {
        "total": 0,
        "max_score": null,
        "hits": []
    }
}

Query for other word

GET 127.0.0.1:9200/stacktest/_search
{
    "query": {
        "match": {
            "content": "how"
        }
    }
}

{
    "took": 4,
    "timed_out": false,
    "_shards": {
        "total": 5,
        "successful": 5,
        "failed": 0
    },
    "hits": {
        "total": 4,
        "max_score": 0.49191087,
        "hits": [
            {
                "_index": "stacktest",
                "_type": "question",
                "_id": "AVpGFiGBW2Sd8hDDcFxg",
                "_score": 0.49191087,
                "_source": {
                    "id": 12,
                    "qid1": 25,
                    "qid2": 26,
                    "content": "How easy is it to hack the login password of a Macbook Air?"
                }
            },
            {
                "_index": "stacktest",
                "_type": "question",
                "_id": "AVpGFiGBW2Sd8hDDcFxd",
                "_score": 0.4375115,
                "_source": {
                    "id": 5,
                    "qid1": 11,
                    "qid2": 12,
                    "content": "How do I recover my Facebook login password?"
                }
            },
            {
                "_index": "stacktest",
                "_type": "question",
                "_id": "AVpGFiGBW2Sd8hDDcFxm",
                "_score": 0.3491456,
                "_source": {
                    "id": 29,
                    "qid1": 59,
                    "qid2": 60,
                    "content": "How do I recover my Facebook password without having to reset it?"
                }
            },
            {
                "_index": "stacktest",
                "_type": "question",
                "_id": "AVpGFiGBW2Sd8hDDcFxf",
                "_score": 0.24257512,
                "_source": {
                    "id": 11,
                    "qid1": 23,
                    "qid2": 24,
                    "content": "How easy is it to hack the login password of a Macbook Air?"
                }
            }
        ]
    }
}

I hope that answers your original question.

Sönke Liebau
  • 1,943
  • 14
  • 23
  • Yes Sir, I understood your explanation but the thing is in my code I have explicitly made stop words as "stopwords": ["is","it","possible"] meaning if I search with ** 'is' , 'it' or 'possible' ** I should not get any results but unfortunately am getting search results as under 9 documents found: 15) Is it possible to sleep without dreaming? 0.8613818 .... – khangaroth Feb 16 '17 at 01:49
  • Apologies! I had indeed not fully read your code and missed the part where you queried - I thought you just retrieved all documents and wondered why the stopwords were still there. I will edit my response to answer your original question. – Sönke Liebau Feb 16 '17 at 08:47
  • @khangaroth if this was helpful to you and solved your problem would you mind accepting Sönke's? Thank you! – Lars Francke Feb 20 '17 at 20:31