4

Is it possible to obtain proper capitalization for e.g. English text using ICU4C but without building any custom set of non-capitalized words? Say, given pining for the fjords I'd like to obtain Pining for the Fjords.

With ucasemap_utf8ToTitle() and UnicodeString::toTitle I get Pining For The Fjords, no matter which BreakIterator or locale I use.

gagolews
  • 12,836
  • 2
  • 50
  • 75
  • 6
    This is too language dependent (you need a list of stop words, such as articles and propositions) to generalize. Also, it might be context dependent: "I Have Seen the Departed on Television". Finally, it's a matter of preference which words to capitalize and which not. – Jongware Apr 19 '14 at 19:40
  • I have the same impression. – gagolews Apr 19 '14 at 19:45
  • 1
    @Jongware, I have decided to iterate each word using ICU's **BreakIterator** and compare to my own list of stop words. – Caroline Beltran Dec 09 '20 at 17:01

1 Answers1

5

@Jongware should get the credit for explaining this so well. Your question might be - does ICU have a list of non-capitalized words?

But the short answer for ICU is: No.

CLDR (from whence ICU gets its data) used to have "Stop words" for search purposes, but they were not well maintained and removed: http://unicode.org/cldr/trac/ticket/5204

Steven R. Loomis
  • 4,228
  • 28
  • 39
  • 1
    Hello Steven, your post is a few years old now and I just ran a test with the latest (ICU 68.1) and 'stop words' are not handled by toTitle. I'm wondering if there are any C/C++ overrides or demo code available using stop words. Thank you. – Caroline Beltran Dec 09 '20 at 16:18
  • 1
    Hi Caroline. I don't think there's any change in the situation, as noted, these were not maintained. If you have a source for maintained stop words and their usefulness, it could be revisitted. – Steven R. Loomis Dec 17 '20 at 17:19