-1

I am using elasticsearch-py to index tweets (originally in JSON format). In order to preserve special characters like hashtags, user targets and emoticons, I specified a special mapping while creating the index. This is what it looks like:

from elasticsearch import Elasticsearch
import sys,json
es = Elasticsearch()

es.indices.create(
    index='ecommercetweets',
    body={
          "settings" : {
                "index" : {
                    "number_of_shards" : 1,
                    "number_of_replicas" : 1
                },  
                "analysis" : {
                    "filter" : {
                        "tweet_filter" : {
                            "type" : "word_delimiter",
                            "type_table": ["# => ALPHA", "@ => ALPHA", ":) => ALPHA", ":( => ALPHA"]
                        }   
                    },
                    "analyzer" : {
                        "tweet_analyzer" : {
                            "type" : "custom",
                            "tokenizer" : "whitespace",
                            "filter" : ["lowercase", "tweet_filter"]
                        }
                    }
                }
            },
            "mappings" : {
                "tweet" : {
                    "properties" : {
                        "text" : {
                            "analyzer" : "tweet_analyzer"
                        }
                    }
                }
            }
      },
      ignore=400
)

fin = open(sys.argv[1],"r")
count = 0
for line in fin:
    jsonLine = json.loads(line)
    doc = {
        'tweetId' : jsonLine["id"],
        'text' : jsonLine["text"],
        'userId' : jsonLine["user"]["id"],
        'favorite_count' : jsonLine["favorite_count"],
        'retweet_count' :jsonLine["retweet_count"],
        'language': jsonLine["lang"],
        'dateTime':jsonLine["created_at"],
        'location':jsonLine["place"]
    }

    es.index(index='ecommercetweets', doc_type='tweet', id=count, body=doc)
    count+=1

I am searching using this command:

results1 = es.search(index='ecommercetweets',q="text:delivery")
results2 = es.search(index='ecommercetweets',q="text:#delivery")

Both returns the same number of hits, although I am pretty sure this should not be the case for the data I am using.

Am I going wrong with the search command?

Satarupa Guha
  • 1,267
  • 13
  • 20
  • If I'm not mistaking, queries are analyzed as well. I haven't used elastic with python, so can't provide an example, but likely you need to specify the same analyzer for search. – J0HN Apr 28 '15 at 12:44
  • results3 = es.search(index='ecommercetweets',q="text:delivery",analyzer="tweet_analyzer"). This sadly still returns the same number of hits. – Satarupa Guha Apr 29 '15 at 13:35

1 Answers1

1

One way you can deal with it is to use a term query (or term filter). This should do it:

es.search(index='ecommercetweets',body={
   "query": {
      "term": {
         "text": {
            "value": "#delivery"
         }
      }
   }
})

Here is some code I used to play around with it:

http://sense.qbox.io/gist/fe61f0cd92b465276b261100cbe7f4778002a96d

Sloan Ahrens
  • 8,588
  • 2
  • 29
  • 31
  • Thanks a lot. Following your code I was able to replicate the situation in Sense. But not the equivalent in Python. Strangely in Python, it shows 0 results for #delivery. Is it possible to do the whole thing in Sense, as in, read the JSON from a file and then populate the index in a loop, etc.? – Satarupa Guha Apr 29 '15 at 05:52