Lucene full-text index: all indexed nodes with same score?

Question

I have been trying solving this issue since days.

I want to do a START query against full-text, ordered by relevance, so to paginate results.

Gladly, I finally found this thread on full-text indexing and neo (and using python as driver).

[https://groups.google.com/forum/#!topic/neo4j/9G8fcjVuuLw]

I had imported my db with batch super-importer, and got a reply of @Michaelhunger who kindly noticed there was a bug, all scores would had been imported the same value.

So, now I am recreating the index, and checking the score via REST (&order=score)

http://localhost:7474/db/data/index/node/myInde?query=name:myKeyWord&order=score

and noticed that entries have still the same score.

(You've got to do an ajax query to see it cause if you use the web console you won't see all data!!)

My code to recreate a full-text lucene index, having each node property 'name': (here using neo4j-rest-client, but I will try also with py2neo as in the Google discussion):

from neo4jrestclient.client import GraphDatabase
gdb = GraphDatabase("http://localhost:7474/db/data/")

myIndex =  gdb.nodes.indexes.create("myIndex", type="fulltext", provider="lucene")

myIndex.add("name",node.get("name"),node)

results:

http://localhost:7474/db/data/index/node/myInde?query=name:DNA&order=score

data Object {id: 17062920, name: "DNA damage theory of aging"}
VM995:10 **score 11.097855567932129**
...
data Object {id: 17022698, name: "DNA (film)"}
VM995:10 **score 11.097855567932129**

In the documentation: [http://neo4j.com/docs/stable/indexing-lucene-extras.html#indexing-lucene-sort] it is written that Lucene does the sorting itself very well, so I understood it creates a ranking by itself in import; it does not.

What am I doing wrong or missing?

score 1 · Answer 1 · answered Aug 23 '15 at 17:26

1

I believe the issue you are seeing is related to a combination of the text you are indexing, the query term(s) and as Michael Hunger pointed out the current lucene configuration in Neo4j which has OMITNORMS=true. With this setting a lucene query, as in your posted examples, where there is text of different size but the query term appears once in each document often results in the same lucene relevancy score. The reason is that the size/length of the document being indexed (field length normalization) is NOT taken into account when OMITNORMS is true.

Looking at your examples it is not clear what your expected results are. For example, are you expecting documents with shorter text to appear first?

In my own experience using lucene and Neo4j I have seen many instances where the relevancy scores being returned are different across different queries.

answered Aug 23 '15 at 17:26

mfkilgore

126
1
3

Hi @mfkilgore, where is it possible to set `OMITNORMS=false` ? The results I am expecting is described here: [http://stackoverflow.com/q/31862761/305883] - I need to paginate the results, and the first n items must meet the keywords in the user query - likely shortest strings for a simplest case; it could be obtained with levenshtein or other rankings that index stems words. As example, using `nltk` in python (I see you work with py2neo), one could even extract the words stems and add them to legacy index - yet it is important the final step of having scores properly set. – user305883 Aug 24 '15 at 12:59
There is no way in the current release to set OMITNORMS. Looking at your example, it is not clear what the problem is, the text shown does contain the phrase United States which is being correctly found in the query. Scoring in lucene is an interesting topic you can find more here - http://lucene.apache.org/core/3_6_2/scoring.html. Note it is possible for more than one document to have the same score. – mfkilgore Aug 24 '15 at 15:19
Results are no meaningful if they are merely met by keyword in a string, with no order. In the example I provided, 'United States' would be buried down thousands of rows, and a first result such 'List of United States National Historic Landmarks in United States commonwealths and territories, associated states, and foreign states' is "random". Actually, after much research in threads, I found it is sorted by the ID of neo4j. If not possible, I conclude neo4j documentation is misleading in this aspect, since it suggests examples of relevance applied to lucene legacy index; not possible instead. – user305883 Aug 24 '15 at 15:40
I am sorry you are seeing issues. In many of my queries, I do see different scores and the results are returned in relevance order. Depending on the query, I also see results with the same score for the top documents which might be what you are seeing, this can happen, and when it does there is no guaranteed order. – mfkilgore Aug 24 '15 at 18:11
I see same score for **all queries** (on a db of 4M entities, independent from numbers of rows matching the query). If there is a relevance order, then there must be a criteria for how it is set, and the examples I posted indicates the only criteria is the ID of neo4j. I will be happy to ..ahem.. "sort it out" with you. Would you be available for a chat? It could be very helpful. – user305883 Aug 24 '15 at 21:03
I would be happy to work with you on this, as a first step would it be possible to share your database? Please contact me directly. – mfkilgore Aug 25 '15 at 14:31

score 0 · Accepted Answer · answered Oct 05 '15 at 14:58

The goal of my question is to obtain a list of results ordered by relevance of nodes' names matching the queried keywords.

@mfkilgore point out this work-around:

start n=node:topic('name:(keyword1* AND keyword2*)') MATCH (n)  with n order by length(split(n.name," ")) asc limit 20 return n

This workaround counts the chars in a node's name, and then order by length of string.

Lucene full-text index: all indexed nodes with same score?

2 Answers2