I have a big feed of news articles that I'm indexing. I'd like to avoid indexing a lot of articles that are nearly the same (for example, articles from a news service might appear many times with slightly different date formats).
So I thought I'd do a more-like-this query with each article. If I get back a hit with a score > some cutoff, then I figure the article is already indexed, and I don't bother with it.
But when I run my more-like-this query, all the hits I get come back with a score of zero. I can't tell if that's expected, if I'm doing something wrong, or if I've discovered a bug.
My query looks like:
POST _search
{"query":
{"bool":
{"filter": [
{"more_like_this":
{"fields": ["text"],
"like": "Doctor Sentenced In $3.1M Health Care Fraud Scheme Justice Department Documents & Publications \nGreenbelt, Maryland - U.S. District Judge Deborah K. Chasanow sentenced physician [snip]"
}
}
]
}
}
And the results I get back are:
{
"took": 8,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 390,
"max_score": 0,
"hits": [
[snip]