14

This question has been asked before:

Postgresql full text search in postgresql - japanese, chinese, arabic

but there are no answers for Chinese as far as I can see. I took a look at the OpenOffice wiki, and it doesn't have a dictionary for Chinese.

Edit: As we are already successfully using PG's internal FTS engine for English documents, we don't want to move to an external indexing engine. Basically, what I'm looking for is a Chinese FTS configuration, including parser and dictionaries for Simplified Chinese (Mandarin).

Community
  • 1
  • 1
Mike Chamberlain
  • 39,692
  • 27
  • 110
  • 158
  • As we were unable to find a solution for this (even with the bounty I offered) we eventually moved to SQL Server, which natively supports Chinese FTS. Luckily our application was designed to be fairly DB vendor agnostic, so this wasn't a huge problem for us. – Mike Chamberlain Dec 20 '10 at 10:35

3 Answers3

6

I know it's an old question but there's a Postgres extension for Chinese: https://github.com/amutu/zhparser/

Rui Pacheco
  • 110
  • 1
  • 7
  • I'm getting `text-search query contains only stop words or doesn't contain lexemes, ignored` issues. See https://stackoverflow.com/questions/41659909/fts-non-latin-text-search-query-contains-only-stop-words-or-doesnt-contain-lex – user3871 Jan 17 '17 at 15:33
  • @Growler page not found. – Weihang Jian Jan 22 '20 at 02:06
3

I've just implemented a Chinese FTS solution in PostgreSQL. I did it by creating NGRAM tokens from Chinese input, and creating the necessary tsvectors using an embedded function (in my case I used plpythonu). It works very well (massively preferable to moving to SQL Server!!!).

simon
  • 15,344
  • 5
  • 45
  • 67
2

Index your data with Solr, it's an open source enterprise search server built on top of Lucene.

You can find more info on Solr here:

http://lucene.apache.org/solr/

A good book on how-to (with PDF download immediately) here:

https://www.packtpub.com/solr-1-4-enterprise-search-server/book

And be sure to use a Chinese tokenizer, such as solr.ChineseTokenizerFactory because Chinese is not whitespace delimited.

simon
  • 15,344
  • 5
  • 45
  • 67
Chris Adragna
  • 625
  • 8
  • 18
  • We need to use the FTS engine built into Postgres. We have already successfully implemented English FTS, and want to continue to use the same system for Chinese documents. – Mike Chamberlain Oct 24 '10 at 23:10
  • 1
    Oh, I see. Well, then my answer isn't helpful to you. I see your clarification/edit on the question since your original post. I'm not sure what your timeline will accomodate, but the Solr solutions are open source. You *may* be able to borrow from the ChineseTokenizerFactory -- it's logic overcomes the inherent problem as I understand it to be, that the language is not whitespace delimeted. Best of luck to you. – Chris Adragna Oct 25 '10 at 14:14