0

I'm interested in modifying the break iterator data (zh) as my program is running as the user adds new words. This means that the data cannot be originally packaged and must be generated as I go. Can I use something like udata_setAppData or udata_setCommonData to achieve the result? I expect the .dat for the break iterator to change 2-3 times a day - so loading time should not be the critical issue.

Here's the psuedo code: 1. Start program 2. Generate .dat-like data from database for break iterators 3. Load into icu as zh break iterator

If the user makes a change to the database 4. Drop current .dat for zh break iterator 5. Regenerate .dat-like data 6. Reload

Is this possible. I think it is almost possible if I have a way of replacing U_ICUDAT_BRKITR on the fly.

Update. It seems that to pull this off, I must use code from gencmn to generate the new .dat file.

tofutim
  • 22,664
  • 20
  • 87
  • 148

1 Answers1

0

There is no API to customize the dictionary.

Steven R. Loomis
  • 4,228
  • 28
  • 39
  • you gotta come through for me. :) How about generating a cjdict on the fly, packing it into a .dat file in memory, and using udata_setAppData. Could that work? – tofutim Sep 06 '13 at 00:16
  • Hey. What's the actual use case? Actually, see the code for gendict - I think you could plug the output in memory. Takes a while to build (under a minute). Just a Q of what the use for it as an API. – Steven R. Loomis Sep 06 '13 at 05:53
  • I want to break sentences down into words but according to dicts that the user selects. Also, the user may add or subtract words from dicts. The target locale is 'zh'. It seems that cjdict actually never goes through gendict, so I think I just need to pack my version of cjdict.txt renamed to cjdict.dict into a .dat memory structure and refer to it from udata_setAppData. – tofutim Sep 06 '13 at 15:18
  • It does look like I need to write gendict, then make my .dat, then udata_setAppData. Is that right? – tofutim Sep 06 '13 at 16:28
  • I see " $(INVOKE) $(TOOLBINDIR)/gendict --uchars -c -i $(BUILDDIR) $(BRKSRCDIR)/$(*F).txt $@" in Makefile.in in data. Does this also process cjdict.txt? – tofutim Sep 06 '13 at 16:50
  • Inside gendict there is a comment for DataDict that "may want to put this somewhere in ICU, as it could be useful outside". I vote yes for this. The only thing that needs to be stripped is 'usageAndDie'. – tofutim Sep 06 '13 at 17:30
  • Steven, I got it to work! I switched to udata_setCommonData and had to map the basename to icudt50l/brkitr/cjdict.dict. – tofutim Sep 07 '13 at 00:45
  • I should say... it almost works. It only works the first time. If I setCommonData a second time, it does not overwrite the first. – tofutim Sep 07 '13 at 05:16
  • @tofutim this is probably more of an enhancement request than a SO discussion. You're probably building with prebuilt data that doesn't need to recompile the .dat. replace icu/source/data with the *data.zip, or d/l from subversion, and gendict will run. Think your approach could leak memory as you keep opening new versions. Please file a bug report. – Steven R. Loomis Sep 09 '13 at 22:51