2

For Chinese punctuation chars like , how to detect via Go?

I tried with range table of package unicode like the code below, but Han doesn't include those punctuation chars.

Can you please tell me which range table should I use for this task? (Please refraining from using regex because it's low performance.)

for _, r := range strToDetect {
    if unicode.Is(unicode.Han, r) {
        return true
    }
}
  • 1
    https://go.dev/play/p/WR2tZa0EliF ? –  Feb 03 '22 at 13:25
  • 1
    @mh-cbon Thanks for your reply! Actually I need more general solution, not just list all the punc chars. https://en.wikipedia.org/wiki/Chinese_punctuation – Yunguan Ting Feb 03 '22 at 13:34
  • 2
    Seeing how you have exceptions to the character set like`~` (U+FF5E is not punctuation), you are going to need to add an additional check for those not in the list of Chinese punctuation. – JimB Feb 03 '22 at 14:03

1 Answers1

1

Puctuation marks are scattered about in different Unicode code blocks.


The Unicode® Standard
Version 14.0 – Core Specification

Chapter 6
Writing Systems and Punctuation
https://www.unicode.org/versions/latest/ch06.pdf

Punctuation. The rest of this chapter deals with a special case: punctuation marks, which tend to be scattered about in different blocks and which may be used in common by many scripts. Punctuation characters occur in several widely separated places in the blocks, including Basic Latin, Latin-1 Supplement, General Punctuation, Supplemental Punctuation, and CJK Symbols and Punctuation. There are also occasional punctuation characters in blocks for specific scripts.


Here are two of your examples,

〜 Wave Dash U+301C

。Ideographic Full Stop U+3002


package main

import (
    "fmt"
    "unicode"
)

func main() {
    // CJK Symbols and Punctuation Unicode block
    for r := rune('\u3000'); r <= '\u303F'; r++ {
        if unicode.IsPunct(r) {
            fmt.Printf("%[1]U\t%[1]c\n", r)
        }
    }
}

https://go.dev/play/p/WoJjM6JKTYR

U+3001  、
U+3002  。
U+3003  〃
U+3008  〈
U+3009  〉
U+300A  《
U+300B  》
U+300C  「
U+300D  」
U+300E  『
U+300F  』
U+3010  【
U+3011  】
U+3014  〔
U+3015  〕
U+3016  〖
U+3017  〗
U+3018  〘
U+3019  〙
U+301A  〚
U+301B  〛
U+301C  〜
U+301D  〝
U+301E  〞
U+301F  〟
U+3030  〰
U+303D  〽
rocka2q
  • 2,473
  • 4
  • 11