1

I'm trying to use whoosh to do text searches.

When I search for a string containing - (ex.: 'IGF-1R'), it ends up searching for 'IGF' AND '1R', hence not treating it as a single string.

Any idea why?

Here is the code I'm using:

class MyFuzzyTerm(FuzzyTerm):
     def __init__(self, fieldname, text, boost=1.0, maxdist=1, prefixlength=2, constantscore=True):
          super(MyFuzzyTerm, self).__init__(fieldname, text, boost, maxdist, prefixlength, constantscore)

with ix.searcher() as searcher:
    qp = QueryParser("gene", schema=ix.schema, termclass=MyFuzzyTerm)
    q = qp.parse('IGF-1R')

q returns:

And([MyFuzzyTerm('gene', 'igf', boost=1.000000, maxdist=1, prefixlength=2), MyFuzzyTerm('gene', '1r', boost=1.000000, maxdist=1, prefixlength=2)])

I'd like it to be:

MyFuzzyTerm('gene', 'igf-1r', boost=1.000000, maxdist=1, prefixlength=2)
Assem
  • 11,574
  • 5
  • 59
  • 97
yoann
  • 311
  • 5
  • 17

1 Answers1

0

Separating text into words is the job of tokenizer, I usually use the whoosh.analysis.SpaceSeparatedTokenizer() but for your case the tokenizer is separating based on space and dash.
So I bet you are using the whoosh.analysis.CharsetTokenizer(charmap) with (space, dash) within charmap or the whoosh.analysis.RegexTokenizer(expression=<_sre.SRE_Pattern object>, gaps=False).

ismnoiet
  • 4,129
  • 24
  • 30
Assem
  • 11,574
  • 5
  • 59
  • 97