4

I am looking to utilize the ICU library for transliteration, but I would like to provide a custom transliteration file for a set of specific custom transliterations, to be incorporated into the ICU core at compile time for use in binary form elsewhere. I am working with the source of ICU 4.2 for compatibility reasons.

As I understand it, from the ICU Data page of their website, one way of going about this is to create the file trnslocal.mk within ICUHOME/source/data/translit/ , and within this file have the single line TRANSLIT_SOURCE_LOCAL=custom.txt.

For the custom.txt file itself, I used the following format, based on the master file root.txt:

custom{
    RuleBasedTransliteratorIDs {
            Kanji-Romaji {
            file {
              resource:process(transliterator){"custom/Kanji_Romaji.txt"}
              direction{"FORWARD"}
            }
         }
    }
    TransliteratorNamePattern {
        // Format for the display name of a Transliterator.
        // This is the language-neutral form of this resource.
        "{0,choice,0#|1#{1}|2#{1}-{2}}" // Display name
    }
    // Transliterator display names
    // This is the English form of this resource.
    "%Translit%Hex"         { "%Translit%Hex" }
    "%Translit%UnicodeName" { "%Translit%UnicodeName" }
    "%Translit%UnicodeChar" { "%Translit%UnicodeChar" }
    TransliterateLATIN{        
        "",
        ""
    }
}

I then store within the directory custom the file Kanji_Romaji.txt, as found here. Because it uses > instead of the I have seen in other files, I converted each entry appropriately, so they now look like:

丁 → Tei ;
七 → Shichi ;

When I compile the ICU project, I am presented with no errors.

When I attempt to utilize this custom transliterator within a testfile, however (a testfile that works fine with the in-built transliterators), I am met with the error error: 65569:U_INVALID_ID.

I am using the following code to construct the transliterator and output the error:

UErrorCode status = U_ZERO_ERROR;
Transliterator *K_R = Transliterator::createInstance("Kanji-Romaji", UTRANS_FORWARD, status);
if (U_FAILURE(status))
{
std::cout << "error: " << status << ":" << u_errorName(status) << std::endl;
return 0;
}

Additionally, a loop through to Transliterator::countAvailableIDs() and Transliterator::getAvailableID(i) does not list my custom transliteration. I remember reading with regard to custom converters that they must be registered within /source/data/mappings/convrtrs.txt . Is there a similar file for transliterators?

It seems that my custom transliterator is either not being built into the appropriate packages (though there are no compile errors), is improperly formatted, or somehow not being registered for use. Incidentally, I am aware of the RuleBasedTransliterator route at runtime, but I would prefer to be able to compile the custom transliterations for use in any produced binary.

Let me know if any additional clarification is necessary. I know there is at least one ICU programmer on here, who has been quite helpful in other posts I have written and seen elsewhere as well. I would appreciate any help I can find. Thank you in advance!

Comic Sans MS Lover
  • 1,729
  • 5
  • 26
  • 52
NatHillard
  • 306
  • 2
  • 10
  • A potentially less-sustainable, temporary workaround is to simply add the lines Kanji-Romaji {...} directly to the root.txt file. I noticed, though, that if I change the root element of my trnslocal file from "custom" to "root" (mimicking root.txt), it sort of works, but doesn't contain all of the transliterators within root.txt. If I modify the trnslocal.mk file to read `TRANSLIT_SOURCE_LOCAL=custom.txt root.txt en.txt el.txt`, there is an error at compile time, presumably because both custom.txt and root.txt now have a root element named "root" – NatHillard Jun 08 '11 at 17:33
  • Can you expand on "ICU 4.2 for compatibility reasons"? – Steven R. Loomis Jun 09 '11 at 19:05
  • This is somewhat complicated - our build environment is vs2005, because our partners work with this platform. As far as I can tell, the last release of icu built with vs2005 was 4.0, but it is possible to convert sln files from vs2008 to vs2005, so we went with 4.2 (though 4.4 is also vs2008). It doesn't seem to be possible to convert vs2010 sln or projx files to vs2005 without manually importing the source and dependencies, because of the new solution and project formats in vs2010. We could use the binaries, but given our custom tranlisterators and potential dependency issues source is better – NatHillard Jun 10 '11 at 15:22
  • 1
    This is a common misconception- we need to figure out how to explain it in a better way. Don't stay on an old ICU because of the compiler.. the vs2010 project files are just for our development. You can use cygwin with 2005 and it should build, or I think 2010 itself to retarget the output for 2005. – Steven R. Loomis Jun 10 '11 at 16:02
  • 1
    @Nat Off-the topic but: plus, you should really consider upgrading from such an old compiler. All sorts of supposed bugs we run into are really bugs in compilers, especially optimizers. VS2010 express (32-bit only) is a free download. – Steven R. Loomis Jun 10 '11 at 16:47
  • I completely agree re: compiler upgrade. I noticed significant speed improvements in ICU using 4.8 with vs2010 (which we have in house), but I have come across speed improvements in other programs as well using the vs2010 compiler. I will talk this over with my supervisor who makes the build decisions. Using multi-targeting with VS2010 seems to be a good temporary, halfway solution, as it allows for the latest ICU, but ideally we would upgrade the compiler also. – NatHillard Jun 10 '11 at 17:13

1 Answers1

3

Transliterators are sourced from CLDR - you could add your transliterator to CLDR (the crosswire directory contains it in XML format in the cldr/ directory) and rebuild ICU data. ICU doesn't have a simple mechanism for adding transliterators as you are trying to do. What I would do is forget about trnslocal.mk or custom.txt as you don't need to add any files, and simply modify root.txt - you might file a bug if you have a suggested improvement.

Steven R. Loomis
  • 4,228
  • 28
  • 39
  • 1
    Thank you once again for a helpful answer! I should say that I have found ICU to really be the best package of its kind, and I appreciate the work you all put into it. I might file a bug report about trnslocal, because it is mentioned briefly in the documentation, but the root.txt solution works well for us. – NatHillard Jun 10 '11 at 15:27
  • welcome. the problem isn't trnslocal - trnslocal works great for adding additional locales - containing additional translations of the NAMES of transliterators. The problem is that the process for adding new transliterators is both undocumented and cumbersome. They need to be stored within the root bundle. CLDR's transliterator process however is well documented. As is the generation of ICU data from CLDR. So you are hand editing files that we don't normally modify. – Steven R. Loomis Jun 10 '11 at 16:44