0

We are trying to break Japanese sentences into words using BreakIterator by following the code in this question. This code is working fine only for the text which is given in the question and when we try giving a different text e.g "速い茶色のキツネは怠惰な犬を飛び越えます" it is unable to break the words.

What could be the issue?

antnewbee
  • 1,779
  • 4
  • 25
  • 38
  • The mentioned solution splits on `。`. It splits sentences. Why did you assume it would split the sentenxce with no periods into words? Also, what exactly is a *word* here? – Fureeish Oct 08 '20 at 08:53
  • @Fureeish are you sure that it uses punctations to break the text into sentences and it won't work when the provided text doesn't have any punctuation? – Om Infowave Developers Jan 13 '23 at 13:41
  • 1
    @OmInfowaveDevelopers yes, I am quite sure. – Fureeish Jan 13 '23 at 17:13

1 Answers1

1

BreakIterator.getSentenceInstance(Locale.JAPAN) in this question breaks a Japanese script into sentences, rather than words. Usually, the Japanese language is written without punctuation to separate words.

You have to use a morphological analyzer to break a sentence into words. For example, you can use a Java port of TinySegmenter.

import java.util.List;
import jp.toastkid.libs.tinysegmenter.TinySegmenter;

public class Test {
  public static void main(String[] args) {
      TinySegmenter ts = TinySegmenter.getInstance();
      List<String> list = ts.segment("速い茶色のキツネは怠惰な犬を飛び越えます。");
      System.out.println(String.join(" | ", list));
      // You will get "速い | 茶色 | の | キツネ | は | 怠惰 | な | 犬 | を | 飛び越え | ます"
  }
}
SATO Yusuke
  • 1,600
  • 15
  • 39
  • Isn't `怠惰な` a single word? And what about `飛び越えます`? My knowledge is very limited, but `ます` doesn't *feel* like it qualifies for an independent word. Could you please elaborate on that, perhaps strictly from the japanese language point of view? – Fureeish Mar 09 '23 at 23:08
  • Based on the knowledge from Japanese junior high program: "怠惰な" is a single word, conjugation of a 形容動詞 (adjectival noun) "怠惰だ". "ます" is a 助動詞 in Japanese language, see: https://ja.wikipedia.org/wiki/%E5%8A%A9%E5%8B%95%E8%A9%9E_(%E5%9B%BD%E6%96%87%E6%B3%95) – SATO Yusuke Mar 10 '23 at 16:49
  • So if `怠惰な` is a single word, then the library you linked incorrectly treats it as two separate words, doesn't it? – Fureeish Mar 10 '23 at 21:38
  • @Fureeish You're exactly right. But when an adjectival noun finishes with "だ", it can't be determined an adjectival noun or noun + 助動詞 only by appearance. For example, another morphological analyzer MeCab (the most widely used morphological analyzer for Japanese) also identifies "怠惰だ" as a noun + 助動詞. see: http://www4414uj.sakura.ne.jp/Yasanichi1/unicheck/ I have taught some ways to distinguish them in junior high (e.g. it even makes sense when adding some adverb before the word), but these methods need natural-language semantic analysis. – SATO Yusuke Mar 11 '23 at 17:01
  • Makes sense. Many thanks for in depth explanation! – Fureeish Mar 11 '23 at 21:26