In elasticsearch-py, how different should the search command be if I used custom tokenization during indexing?

Question

I am using elasticsearch-py to index tweets (originally in JSON format). In order to preserve special characters like hashtags, user targets and emoticons, I specified a special mapping while creating the index. This is what it looks like:

from elasticsearch import Elasticsearch
import sys,json
es = Elasticsearch()

es.indices.create(
    index='ecommercetweets',
    body={
          "settings" : {
                "index" : {
                    "number_of_shards" : 1,
                    "number_of_replicas" : 1
                },  
                "analysis" : {
                    "filter" : {
                        "tweet_filter" : {
                            "type" : "word_delimiter",
                            "type_table": ["# => ALPHA", "@ => ALPHA", ":) => ALPHA", ":( => ALPHA"]
                        }   
                    },
                    "analyzer" : {
                        "tweet_analyzer" : {
                            "type" : "custom",
                            "tokenizer" : "whitespace",
                            "filter" : ["lowercase", "tweet_filter"]
                        }
                    }
                }
            },
            "mappings" : {
                "tweet" : {
                    "properties" : {
                        "text" : {
                            "analyzer" : "tweet_analyzer"
                        }
                    }
                }
            }
      },
      ignore=400
)

fin = open(sys.argv[1],"r")
count = 0
for line in fin:
    jsonLine = json.loads(line)
    doc = {
        'tweetId' : jsonLine["id"],
        'text' : jsonLine["text"],
        'userId' : jsonLine["user"]["id"],
        'favorite_count' : jsonLine["favorite_count"],
        'retweet_count' :jsonLine["retweet_count"],
        'language': jsonLine["lang"],
        'dateTime':jsonLine["created_at"],
        'location':jsonLine["place"]
    }

    es.index(index='ecommercetweets', doc_type='tweet', id=count, body=doc)
    count+=1

I am searching using this command:

results1 = es.search(index='ecommercetweets',q="text:delivery")
results2 = es.search(index='ecommercetweets',q="text:#delivery")

Both returns the same number of hits, although I am pretty sure this should not be the case for the data I am using.

Am I going wrong with the search command?

If I'm not mistaking, queries are analyzed as well. I haven't used elastic with python, so can't provide an example, but likely you need to specify the same analyzer for search. — J0HN, Apr 28 '15 at 12:44
results3 = es.search(index='ecommercetweets',q="text:delivery",analyzer="tweet_analyzer"). This sadly still returns the same number of hits. — Satarupa Guha, Apr 29 '15 at 13:35

score 1 · Answer 1 · answered Apr 28 '15 at 12:47

1

One way you can deal with it is to use a term query (or term filter). This should do it:

es.search(index='ecommercetweets',body={
   "query": {
      "term": {
         "text": {
            "value": "#delivery"
         }
      }
   }
})

Here is some code I used to play around with it:

http://sense.qbox.io/gist/fe61f0cd92b465276b261100cbe7f4778002a96d

answered Apr 28 '15 at 12:47

Sloan Ahrens

8,588
2
29
31

Thanks a lot. Following your code I was able to replicate the situation in Sense. But not the equivalent in Python. Strangely in Python, it shows 0 results for #delivery. Is it possible to do the whole thing in Sense, as in, read the JSON from a file and then populate the index in a loop, etc.? – Satarupa Guha Apr 29 '15 at 05:52

In elasticsearch-py, how different should the search command be if I used custom tokenization during indexing?

1 Answers1