5

In sqlite I:

  1. Perform a create virtual MyTable (tokenize =icu ,id text,subject text,abstract text)
  2. Then successfully insert info MyTable (id,subject,abstract) values (?,?,?) so I have the row: 今天天气不错fmowomrogmeog,wfomgomrg,我是谁erz

When I perform select id from MyTable where MyTable match ‘z*’ it does not return anything,Whenever I search the single letter it returns nothing. However if I search ‘m’ or ‘天气’ or ‘天’,it works.

I know sqlite only support prefix, so I am using ICU. Am I making a mistake?

Note I've looked at the source code on foxmail,it looks to me like I can search ',' 'f' and so on.

andygavin
  • 2,784
  • 22
  • 32

2 Answers2

5

Try Hai Feng Kao's character tokenizer. It can search prefix, postfix and anything in between. It supports Chinese as well. I don't think you can find any other tokenizers which support arbitrarily substring search.

BTW, it is a shameless self-promotion.

If you want to open a database encoded by character tokenizer in Objective-C, do the following:

#import <FMDB/FMDatabase.h>
#import "character_tokenizer.h"

FMDatabase* database = [[FMDatabase alloc] initWithPath:@"my_database.db"];
if ([database open]) {
    // add FTS support
    const sqlite3_tokenizer_module *ptr;
    get_character_tokenizer_module(&ptr);
    registerTokenizer(database.sqliteHandle, "character", ptr);
}
Hai Feng Kao
  • 5,219
  • 2
  • 27
  • 38
  • thanks,i will try it.my app need to support li18n tokenize,it can support?And foxmail can support like 'f' or 'c' in 'abcdef'. – user1243169 Aug 23 '13 at 04:24
  • I don't know what is "li18n", but my tokenizer can support 'f' or 'c' in 'abcdef'. I am pretty sure that ICU tokenizer doesn't support substring search. – Hai Feng Kao Aug 23 '13 at 05:26
  • i have been tryed your sqlite,it is great and powerful,i find that it can support i18n.i want to comfirm that your sqlite can use in ios4.3? – user1243169 Aug 23 '13 at 09:34
  • The tokenizer is written in C. It should work on all iOS. However, you have to check if your iOS 4.3's sqlite support FTS3 or not. – Hai Feng Kao Aug 23 '13 at 09:43
  • no support,how do i compile sqlite with your tokenizer,I am not familiar with c language – user1243169 Aug 23 '13 at 09:53
  • The tokenizer and sqlite are separated. The demo project illustrates how to register the tokenizer to sqlite. The tokenizer should be compiled with your app. To compile a custom sqlite, check [compiling custom sqlite](http://longweekendmobile.com/2010/06/16/sqlite-full-text-search-for-iphone-ipadyour-own-sqlite-for-iphone-and-ipad/). – Hai Feng Kao Aug 23 '13 at 10:02
  • thans ,i can compile sqlite with your tonkenizer in ios4.3,do you tokenizer only support en,zh,jap?what about others? – user1243169 Aug 23 '13 at 13:34
  • I have never tested it on other languages, sorry – Hai Feng Kao Aug 23 '13 at 15:58
  • Hai, I'd like to use your tokenizer on my app, but I can't figure out how to use it with FMDB. Would you have an example? Thank you! – neowinston Nov 25 '15 at 23:45
  • @Winston Here you go! – Hai Feng Kao Nov 26 '15 at 06:36
  • Thanks a lot, will try and let you know the results! – neowinston Nov 26 '15 at 20:47
  • 1
    Hai, you saved the day! Your sample code worded like a charm. I thank you very much my friend! – neowinston Nov 26 '15 at 21:08
3

You may also try FMDB's FMSimpleTokenizer. FMSimpleTokenizer uses build-in CFStringTokenizer and according to apple document "CFStringTokenizer allows you to tokenize strings into words, sentences or paragraphs in a language-neutral way. It supports languages such as Japanese and Chinese that do not delimit words by spaces"

If you check FMSimpleTokenizer code, you will find that is done by calling CFStringTokenizerAdvanceToNextToken & CFStringTokenizerGetCurrentTokenRange.

One interesting "fact" is how CFStringTokenizer tokenizes the Chinese words, for example "欢迎使用" will be tokenize into "欢迎" & "使用", which totally makes sense, but if you search "迎", you will be surprised to see no result at all!

In that case you probably need to write a tokenizer like Hai Feng Kao's sqlite tokenizer.

Qiulang
  • 10,295
  • 11
  • 80
  • 129