0

I’m using Java 6 (not an option to upgrade at this time). I have a Java string that contains the following value:

My Product Edition 2014©

The last symbol is a copyright symbol (©). When this string outputs to my terminal (using bash on Mac 10.9.5), the copyright symbol renders as a question mark.

I’d like to know how to remove all characters from my string that will render as question marks on my terminal.

dimo414
  • 47,227
  • 18
  • 148
  • 244
Dave A
  • 2,780
  • 9
  • 41
  • 60
  • 1
    Show us the relevant part of the Java code you use. –  May 07 '15 at 20:45
  • Does http://stackoverflow.com/a/19363465/1682419 help by setting the terminal character encoding so Unicode characters print properly? – Jerry101 May 07 '15 at 20:50
  • 1
    if you want to maintain the correct characters when printing you have to check 1) the encoding of your source .java file (normally setting it to UTF-8 will work) [if you use eclipse it's pretty easy to change it from Parameters] 2) the encoding of your console (for MAC this is probably UTF-8 already, but check it). – Kostas Kryptos May 07 '15 at 20:54
  • 1
    The word you're looking for is probably "non-ASCII". If there are Unicode characters you *do* want to print in addition to ASCII, please provide more examples of what you'd like to print vs. strip. – dimo414 May 07 '15 at 20:59
  • 2
    Why do you want to *remove* these characters? Wouldn't it be better to print them correctly? –  May 07 '15 at 21:04
  • 1
    This looks like a [XY problem](https://meta.stackexchange.com/questions/66377/what-is-the-xy-problem). –  May 07 '15 at 21:16
  • Can a Java program detect the character set of the console it is running it? Isn't that a prerequisite for "remove all characters from my string that will render as question marks on my terminal"? – Tom Blodget May 07 '15 at 22:00

4 Answers4

3

The "right" thing to do here is to fix your terminal, so it doesn't print squares. See How do you echo a 4-digit Unicode character in Bash? and try just echoing Unicode characters directly in your terminal. It may be as simple as ensuring your LANG environment variable is set to UTF-8 (on my Mac, $LANG is en_US.UTF-8). You might also consider using a more full-featured terminal, like iTerm2.

If you really want to strip non-ASCII characters in Java instead, there's a number of equally reasonable ways to do so, but my preference is with Guava's CharMatcher, e.g.:

String stripped = CharMatcher.ASCII.retainFrom(original);

You could use a Pattern to strip undesirable characters, but (as demonstrated by the confusion here) it's more hassle than using Guava's out of the box solution.

Community
  • 1
  • 1
dimo414
  • 47,227
  • 18
  • 148
  • 244
2

You better adopt the notion that there is no such thing as a "special character". However, there are a couple of reasons why some characters are not shown correctly.

Java will keep all strings in UTF-16 encoding internally. When you print a string, the characters are converted to the encoding of the corresponding output stream or output writer. Unfortunately, the java runtime tries to be smart and uses what is called the "default" encoding unless you explicitly demanded a specific encoding.

This hurts especially Windows users, where the default encoding often turns out to be some archaic Microsoft "code page". I have yet to find out where I can tell Windows that I don't want their CP 850 (which is the default whenever you have a german keyboard).

In the long run, you'll fare best when you make the following a habit:

  1. Open all your output streams (or writers) with UTF-8 encoding. Don't use System.out/System.err.
  2. Make sure you use a terminal that can handle UTF-8. If you're on windows, enter chcp 65001 to set the encoding of the cmd-window to UTF-8 and use a font that can render the UTF characters.
Ingo
  • 36,037
  • 5
  • 53
  • 100
1

if you want to remove special characters, you could do some thing like this:

String s = "My Product Edition 2014©";

s = s.replaceAll("[^\\w\\s]", "");

System.out.println(s);

Output:

My Product Edition 2014
K139
  • 3,654
  • 13
  • 17
  • Why do you think the OP wants to *remove* the ©? –  May 07 '15 at 20:59
  • I thought, he wanted to remove those special chars which will print as ? in the console. – K139 May 07 '15 at 21:02
  • Better would be to print it correctly. The approach in your answer alters the ouput. That's even worse than escaping unprintable characters by `?`. –  May 07 '15 at 21:03
  • 3
    @Tichodroma yes, it might be better to print it correctly, but K139's answering exactly what OP asked. – dimo414 May 07 '15 at 21:09
  • @dimo414 True, the answer matches the question. But, sadly, the question is a [XY problem](https://meta.stackexchange.com/questions/66377/what-is-the-xy-problem). Answering such a question like it is done here helps nobody. –  May 07 '15 at 21:15
  • @Tichodroma you're *assuming* it's an XY problem; it's possible OP has an entirely legitimate reason to want to strip these characters. Furthermore, others may come across this answer and have legitimate uses. It's good to point out there are often better alternatives, but there's no need to blame the answerer. – dimo414 May 07 '15 at 21:20
1

You can trim all characters other than non readable ASCII character using regEx and replaceAll()

public static String asciiOnly(String unicodeString)
{
    String asciiString = unicodeString.replaceAll("[^\\x20-\\x7E]", "");
    return asciiString;
}

Here is the explanation of Regular expression "[^\\x20-\\x7E]":

  • ^ - Not
  • \\x20 - Hex value representing space which is first writable ASCII character.
  • - - Represent to, ie x20 to x7E
  • \\x7E - Hex value representing ~ which is the last writable ASCII character


ASCII

ASCII Details

padippist
  • 1,178
  • 1
  • 16
  • 30