0

I'm making a system that looks through articles about different stuff and picks out some description about it. Basically a lot like a encyclopaedia. At first I ran into a problem where if I searched for "cat" I got a lot of hits to articles like "CAT5", "CAT6", ".cat" and so on. The number one hit was however still "Cat". I was using StandardAnalyzer for this. I received a tip to use WhitespaceAnalyzer instead which solved the original problem and made Lucene drop hits on articles like CAT6, but now the article "Cat" is no longer in my list of hits at all. Why is this? Any suggestions to for example a different analyzer?

EDIT: The code for the search itself:

public static String searchAbstracts(String input, int hitsPerPage) throws ParseException, IOException {
    String query = input;
    StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_41);
    Query q = new QueryParser(Version.LUCENE_41, "article", analyzer).parse(query);
    Directory index = new NIOFSDirectory(new File(INDEX_PATH));
    IndexReader reader = IndexReader.open(index);
    String resultSet = "";

    IndexSearcher searcher = new IndexSearcher(reader);
    TopScoreDocCollector collector = TopScoreDocCollector.create(hitsPerPage, true);
    searcher.search(q, collector);
    ScoreDoc[] hits = collector.topDocs().scoreDocs;

    System.out.println("Found " + hits.length + " articles.");

    for(int i=0;i<hits.length;++i) {
        int docId = hits[i].doc;
        Document d = searcher.doc(docId);
        resultSet += d.get("desc") + " ";
        System.out.println((i + 1) + ". " + d.get("article") + " :: Words from abstract: " + d.get("desc"));
    }
    return resultSet;
}
Geir K.H.
  • 253
  • 1
  • 9

1 Answers1

1

When you run a sentence : "The quick Cat jumped over the lazy CAT6" through WhitespaceAnalyzer this is what it does to it:
[The] [quick] [Cat] [jumped] [over] [the] [lazy] [CAT6]

As you can see "Cat" is clearly with true case in the list, you should be able to find it. How are you querying it? During query what analyzer are you using?

Arun
  • 1,777
  • 10
  • 11
  • I edited my post to include the code that performs the search now. As you can see I'm using WhitespaceAnalyzer for the query as well. (Every index element contains two fields: article and desc, where article is the name of the article, and desc is the abstract from said article). When searching I get hits like "Shröedinger's cat", "phantom cat", "maltese cat" and so on, but the article "cat" is somehow missing. It's also kinda hard to manually look through the index with for example Luke as there are 3.74 million entries in the index.. – Geir K.H. Dec 06 '13 at 13:44
  • You are using StandardAnalyzer for query which processes the data diffrent way than the WhitespaceAnalyzer . Here are examples of how these analyzers process the example phrase "The quick Cat jumped over the lazy CAT6" WhitespaceAnalyzer: [The] [quick] [Cat] [jumped] [over] [the] [lazy] [CAT6] SimpleAnalyzer: [the] [quick] [cat] [jumped] [over] [the] [lazy] [cat] StopAnalyzer: [quick] [cat] [jumped] [over] [lazy] [cat] StandardAnalyzer: [quick] [cat] [jumped] [over] [lazy] [cat6] – Arun Dec 06 '13 at 13:56
  • I'm a retard. I was so sure I had modified my search functions when I swapped over to WhitespaceAnalyzer, but I hadn't obviously. Everything works fine now that I switched over, thanks! That said: I still find it peculiar that StandardAnalyzer didn't find the word cat if the word indeed was picked out by WhitespaceAnalyzer in the first place.. – Geir K.H. Dec 06 '13 at 14:02