0

I am using xapian as a search engine for my website.

Recently I found out it does not search words containg polish specific characters like ś, ą, ć, ę.

Anytime I am trying to search word containg one of those language-specific chars it returns no results. Are there any encoding settings in xapian?

Those are my indexing and searching functions ($document has content, id and route field).

protected function _indexDocument($document, $indexer, $database)
{
    $doc = new XapianDocument();
    $content = Zend_Json::encode($document);
    $doc->set_data($content);

    $indexer->set_document($doc);
    $indexer->index_text($content);
    $term = (string) md5($document['id']);
    $doc->add_boolean_term($term);
    $database->replace_document($term, $doc);
    return true;
}


public function searchDocuments($phrase, $page = 0, $limit = 10)
{
    $page = (int) $page;
    $limit = (int) $limit;

    $database = new XapianDatabase($this->getDatabasePath());
    $enquire = new XapianEnquire($database);
    $qp = new XapianQueryParser();
    $stemmer = new XapianStem("english");
    $qp->set_stemmer($stemmer);
    $qp->set_database($database);
    $qp->set_stemming_strategy(XapianQueryParser::STEM_SOME);
    $query = $qp->parse_query($phrase);

    $enquire->set_query($query);
    $matches = $enquire->get_mset(($page-1) * $limit, $limit);

    $documentCount = $matches->get_matches_estimated();

    $i = $matches->begin();
    $documents = array();
    $rawDocuments = array();
    while (!$i->equals($matches->end())) {
        $n = $i->get_rank() + 1;
        $data = $i->get_document()->get_data();
        $documents[] = $this->_prepareDocument( Zend_Json::decode($data), $phrase );
        $rawDocuments[]= Zend_Json::decode($data);
        $i->next();
    }

    $pageCount = ceil($documentCount / $limit);
    if ($page > 0) {
        $prevPage = ($page - 1) * $limit;
    } else {
        $prevPage = 0;
    }
    if ($page < $pageCount) {
        $nextPage = ($page + 1) * $limit;
    } else {
        $nextPage = $pageCount;
    }
    $result = array('results' => $documents, 'results-raw' => $rawDocuments, 'paginator' => array(
            'page' => $page, 'limit' => $limit, 'pageCount' => $pageCount,
            'prevPage' => $prevPage, 'nextPage' => $nextPage,
            'documentCount' => $documentCount));
    return $result;

}
Somal Somalski
  • 578
  • 1
  • 7
  • 21
  • 1
    Xapian is at heart a binary system, so it doesn't really care what you put through it. The QueryParser (& TermGenerator) do, but they can absolutely handle Unicode characters. If you can package this up as a more minimal code example, the mailing list will be able to help you dig through what's going on. (Putting together a very brief version of what you've done in Python worked fine for me, so it's unlikely to be anything to do with Xapian. Is there a reason why you use `$page-1`, incidentally? That's negative by default in your code.) – James Aylett Feb 27 '13 at 11:08
  • That $page is always initialized with 0. My bad with default param, but it does not matter. The problem in my case was using json. I have replaced json with serialize and it works well now. – Somal Somalski Feb 27 '13 at 11:47

0 Answers0