1

i am using whoosh to index over 200,000 books. but i have encountered some problems with it. the whoosh query parser returns NullQuery for words like "C#", "C++" with meta-characters in them and also for some other short words. this words are used in the title and body of some documents so i am not using keyword type for them. i guess the problem is in the analysis or query-parsing phase of searching or indexing but i can't touch my data blindly. can anyone help me to correct this issue. Tnx.

i fixed the problem by creating a StandardAnalyzer with a regex pattern that meets my requirements,here is the regex pattern:

'\w+[#+.\w]*'

this will make tokenizing of fields to be done successfully, and also the searching goes well. but when i use queries like "some query++*" or "some##*" the parsed query will be a single Every query, just the '*'. also i found that this is not related to my analyzer and this is the Whoosh's default behavior. so here is my new question: is this behavior correct or it is a bug??

note: removing the WildcardPlugin from the query-parser solves this problem but i also need the WildcardPlugin.


now i am using the following code:

from whoosh.util import rcompile
#for matching words like: '.NET', 'C++' and 'C#'
word_pattern = rcompile('(\.|[\w]+)(\.?\w+|#|\+\+)*')
#i don't need words shorter that two characters so i don't change the minsize default
analyzer = analysis.StandardAnalyzer(expression=word_pattern)

... now in my schema:

...
title = fields.TEXT(analyzer=analyzer),
...

this will solve my first problem, yes. but the main problem is in searching. i don't want to let users to search using the Every query or *. but when i parse queries like C++* i end up an Every(*) query. i know that there is some problem but i can't figure out what it is.

ChrisF
  • 134,786
  • 31
  • 255
  • 325

1 Answers1

2

I had the same issue and found out that StandardAnalyzer() uses minsize=2 by default. So in your schema, you have to tell it otherwise.

schema = whoosh.fields.Schema(
  name = whoosh.fields.TEXT(stored=True, analyzer=whoosh.analysis.StandardAnalyzer(minsize=1)),
  # ...
)
kichik
  • 33,220
  • 7
  • 94
  • 114
  • Thanks for your answer kichik. yes you are right by changing the minsize and expression parameter of the `StandardAnalizer` we could change the accepted tokens for indexing. but i have changed my question. – Mohsen Mahmoodi May 04 '13 at 20:32
  • 2
    Hmm... Now how would someone be able to find out the solution to the original problem? It would have been nicer if you opened a new question instead of completely changing this one. – kichik May 04 '13 at 23:02