Regular expression for matching "Shift-JIS" string against given set of ranges

Question

Problem Statement :-

Let's call 0x8140～0x84BE, 0x889F～0x9872, 0ｘ989F～0x9FFC, 0xE040～0xEAA4, 0x8740～0x879C, 0xED40～0xEEFC, 0xFA40～0xFC4B, 0xF040～0xF9FC as range.

I want to validate if input String contains a kanji which is not in the the above range.

Here are examples of input Kanji characters not in the above range with output results :-

龔 --> OK

鑫 --> OK

璐 --> Need Change

Expected result should be "Need Change" for all of them. please help.

Here is a code :-

import java.io.UnsupportedEncodingException;
import java.util.regex.*;
//import java.util.regex.Pattern;

public class RegExpDemo2 {

    private boolean validateMnpName(String name)  {

        try {
            byte[] utf8Bytes = name.getBytes("UTF-8");
            String string = new String(utf8Bytes, "UTF-8");

            byte[] shiftJisBytes = string.getBytes("Shift-JIS");
            String strName = new String(shiftJisBytes, "Shift-JIS");

            System.out.println("ShiftJIS Str name : "+strName);

            final String regex = "([\\x{8140}-\\x{84BE}]+)|([\\x{889F}-\\x{9872}]+)|([\\x{989F}-\\x{9FFC}]+)|([\\x{E040}-\\x{EAA4}]+)|([\\x{8740}-\\x{879C}]+)|([\\x{ED40}-\\x{EEFC}]+)|([\\x{FA40}-\\x{FC4B}]+)|([\\x{F040}-\\x{F9FC}]+)";

            if (Pattern.compile(regex).matcher(strName).find()) {
                return true;
            } else
                return false;
        }
        catch (Exception e) {
            e.printStackTrace();
            return false;
        }

    }

    public static void main(String args[]) {

        RegExpDemo2 obj = new RegExpDemo2();

        if (obj.validateMnpName("ロ")) {
            System.out.println("OK");
        } else {
            System.out.println("Need Change");
        }

    }
}

(1) **Remove** your first four lines of code in the try block. Strings are not bytes. Your round-trip use of bytes accomplishes nothing at all. (2) You said you want to validate if a String contains kanji from those ranges, but you then say that a katakana-only string is a valid input string, while a string which actually contains those kanji is invalid. Did you reverse the words “valid” and “invalid” in your question? — VGR, Nov 26 '20 at 14:55
Java Strings are Unicode (specifically, UTF-16). Doing a 'getbytes' for some other encoding, and then constructing a string from those bytes, results in a conversion from Unicode -> other encoding -> Unicode. — user14644949, Nov 26 '20 at 15:22
I've updated the information in description. Please check and advice. — Mahesh Jadhav, Nov 26 '20 at 16:35
I’m not sure what “for all of them” means, but your code prints `Need Change` when I run it. Is that not what you want? — VGR, Nov 26 '20 at 17:19
Hi @VGR, when you input "龔" or "鑫" in obj.validateMnpName("ロ"), it results as OK, it is incorrect. It seems, regex used to validate such kanji characters for given range is not working properly. — Mahesh Jadhav, Nov 26 '20 at 19:02
Let us start with getting an answer to a fundamental question: when you write codepoint values like 0x8140～0x84BE, **what character encoding are you using**? Unicode? Shift-JIS? Something else? It makes an immense difference to the programming. — user14644949, Nov 27 '20 at 00:31
@user14644949, Character encoding would be Shift-JIS. Goal is to validate Kanji characters for the ranges provided by customer. — Mahesh Jadhav, Nov 27 '20 at 11:17
Why do you believe it is not correct "龔" to result in a match? "龔" is '\u9f94', and you are explicitly including 9f94 in your regular expression when you specify `[\\x{989F}-\\x{9FFC}]+`. — VGR, Dec 01 '20 at 02:29

score 0 · Answer 1 · answered Nov 27 '20 at 13:42

0

Your approach cannot work, because a String is Unicode in Java.

As observed by @VGR and myself, a round-trip through a Shift-JIS byte array does not change that. You simply converted Unicode to Shift-JIS and back to Unicode.

There are two approaches possible:

Convert the Java String (which is Unicode) into an array of bytes (in Shift-JIS encoding), and then examine the byte array for the allowed/forbidden values.
Convert the 'allowed' ranges into Unicode (and a single range in Shift-JIS may not be a single range in Unicode) and work with the String representation in Unicode.

Neither way seems pretty, but if you have to use old character codes instead of the not-quite-so-old (only 30 years!) Unicode, this is necessary.

answered Nov 27 '20 at 13:42

user14644949

321
1
3

Thank you very much for your answer. I'm trying to get solution by approaches provided by you. but seems difficult to apply using regex. Is it possible for you to provide code for this problem statement ? (by any approach). I found link - "https://uic.jp/charset/show/shiftjis2004/" for testing purpose. – Mahesh Jadhav Nov 28 '20 at 18:00
I wouldn't use regex. Either way, I'd loop through the string (either as a string or as a byte array, depending on the approach chosen) examining each character (which may be multibyte for Shift-JIS) to see if it's in the desired range. – user14644949 Nov 28 '20 at 18:52

Regular expression for matching "Shift-JIS" string against given set of ranges

1 Answers1