How to do a fulltext search if the string has '-' in it for e.g "3da549f0-0e88-4297-b6af-5179b74bd929"?

Question

When I indexed the field and search for a string which has '-' in it like the above example then the Arango treat it as negation operator and hence do not search that string. What is the solution of searching these documents which contains '-' in it ?

this issue https://github.com/arangodb/arangodb/issues/928 may help answer this question. — Dongqing, Apr 20 '16 at 08:37
Why do you use a full-text index on this kind of data? If you only filter by equality (`==`), use a hash index instead. If you need to perform a prefix search ("starts with ..."), you can utilize a skiplist index with some creativity, see: http://stackoverflow.com/questions/35587746/on-multiple-index-usage-in-arangodb as well as https://docs.arangodb.com/cookbook/PopulatingAnAutocompleteTextbox.html. For suffix searching ("ends with..."), consider to also store the string [REVERSE()](https://docs.arangodb.com/Aql/StringFunctions.html)d and apply the same technique as for prefix searching. — CodeManX, Apr 20 '16 at 22:30

Nick Fortescue · Answer 1 · 2016-04-20T08:47:13.253

1

Edit: I just went to look at the source code. From there it looks like '-' should only be a problem if it is the first character in the string. And it isn't the first character in the example you give above so I'm confused.

There doesn't look there is any way of escaping '-' characters. Another idea might be prefix the '-' with a '+'. Have you tried:

collection.fulltext(attribute, "3da549f0+-0e88+-4297+-b6af+-5179b74bd929");

I guessed from reading the docs, that using "prefix:" or "complete:" as an escape might work.

collection.fulltext(attribute, "complete:3da549f0-0e88-4297-b6af-5179b74bd929");

But apparently it doesn't.

edited Apr 20 '16 at 08:47

answered Apr 20 '16 at 08:35

Nick Fortescue

43,045
26
106
134

I've updated my answer. Is this really the query your are using? Because the source code looks like it only cares about a '-' at the start of a word (after a comma or a space), not in the middle of a query – Nick Fortescue Apr 20 '16 at 08:51

dothebart · Accepted Answer · 2016-04-21T12:33:27.380

Trying to reproduce what you did. My answer probably could be more accurate if you provide a better reproducible example (with arangosh only) what you're currently trying:

http+tcp://127.0.0.1:8529@_system> db._create("testIndex")
http+tcp://127.0.0.1:8529@_system> db.testIndex.ensureIndex({type: "fulltext", fields: ["complete:3da549f0-0e88-4297-b6af-5179b74bd929"]})
{ 
  "fields" : [ 
    "complete:3da549f0-0e88-4297-b6af-5179b74bd929" 
  ], 
  "id" : "testIndex/4687162", 
  "minLength" : 2, 
  "sparse" : true, 
  "type" : "fulltext", 
  "unique" : false, 
  "isNewlyCreated" : true, 
  "code" : 201 
}

http+tcp://127.0.0.1:8529@_system> db.testIndex.save({'complete:3da549f0-0e88-4297-b6af-5179b74bd929': "find me"})
{ 
  "_id" : "testIndex/4687201", 
  "_key" : "4687201", 
  "_rev" : "4687201" 
}

http+tcp://127.0.0.1:8529@_system> db._query('FOR doc IN FULLTEXT(testIndex, "complete:3da549f0-0e88-4297-b6af-5179b74bd929", "find") RETURN doc')
[object ArangoQueryCursor, count: 1, hasMore: false]


[ 
  { 
    "_id" : "testIndex/4687201", 
    "_key" : "4687201", 
    "_rev" : "4687201", 
    "complete:3da549f0-0e88-4297-b6af-5179b74bd929" : "find me" 
  } 
]

So the usecase looks different:

db.test2.save({id: 'complete:3da549f0-0e88-4297-b6af-5179b74bd929'})
db.test2.ensureIndex({type: "fulltext", fields: ["id"]})

db._query('FOR doc IN FULLTEXT(test2, "id", "3da549f0-0e88-4297-b6af-5179b74bd929") RETURN doc')

which will return an empty result.

To understand whats going on, one needs to know how the fulltext index works. It splits the texts at word boundaries and stores this as a list with a reference to the document in the index. Several documents may be referenced by one word in that index-global wordlist.

Once the index is queried, the requested words are searched in the index global wordlist, and each word found will contain a list of documents with the words in them. These buckets are combined, and returned as a total list of documents to be iterated.

To understand the tokenizer a little better, I've added a tiny js wrapper that invokes it.

Lets have a look at what it does to your string:

SYS_SPLIT_WORDS_ICU("ab cd", 0)
[ 
  "ab", 
  " ", 
  "cd" 
]
SYS_SPLIT_WORDS_ICU("3da549f0-0e88-4297-b6af-5179b74bd929", 0)
[ 
  "3da549f0", 
  "-", 
  "0e88", 
  "-", 
  "4297", 
  "-", 
  "b6af", 
  "-", 
  "5179b74bd929" 
]

So you see, minus are treated as word boundaries, and your string is partitioned. You've got now several opportunities to circumvent this:

remove the minuses on insert
split the search string, and use the most meaningfull part of the hash, followed by a FILTER statement for the actual value
don't use the fulltext index at all for this, but rather a skiplist or a hash index; They're cheaper to maintain, and can be used for FILTER statements

i think you have misunderstood my point. My field is 'id' and the value is '3da549f0-0e88-4297-b6af-5179b74bd929'. So I have to index the field 'id' to search the term '3da549f0-0e88-4297-b6af-5179b74bd929'. I have done the indexing on 'id' and could search 'prefix:3da549f0' but the moment i search 'prefix:3da549f0-0e88-4297-b6af-5179b74bd929' or 'complete:3da549f0-0e88-4297-b6af-5179b74bd929' it gives me no result — Haseb Ansari, Apr 21 '16 at 10:36
One more question if my value is "-11" then how would the fulltext index will work to find this value.......I want to use fulltext index rather than hash or skip index. — Haseb Ansari, Apr 25 '16 at 07:38
it splits the minus from the 11. You should also note that the fulltext index is more expensive on insert & update modifications - so if you want a full string match or a range string match ala `FILTER a.i >='Ab' && a.i <'Az'` the skiplist can also do that for you. — dothebart, Apr 25 '16 at 07:59
Yes I could use skiplist but the problem is there are other fields which require fulltext index and I dont know which field in future would have minus in the value......so for uniformity I want to use fulltext index to handle everything irrespective what the value is.....So how could I handle minus with fulltext index. — Haseb Ansari, Apr 25 '16 at 08:18
i.e. encode them on insert to `_45` but again, I would strongly discourage using the fulltext index for something it wasn't meant to used for. Even if that does mean that you have to use / generate different queries in some cases. — dothebart, Apr 25 '16 at 08:24

How to do a fulltext search if the string has '-' in it for e.g "3da549f0-0e88-4297-b6af-5179b74bd929"?

2 Answers2