8

I have a string "\\u003c", which belongs to UTF-8 charset. I am unable to decode it to unicode because of the presence of double backslashes. How do i get "\u003c" from "\\u003c"? I am using java.

I tried with,

myString.replace("\\\\", "\\");

but could not achieve what i wanted.

This is my code,

String myString = FileUtils.readFileToString(file);
String a = myString.replace("\\\\", "\\");
byte[] utf8 = a.getBytes();

// Convert from UTF-8 to Unicode
a = new String(utf8, "UTF-8");
System.out.println("Converted string is:"+a);

and content of the file is

\u003c

Vinay thallam
  • 391
  • 1
  • 7
  • 17

7 Answers7

11

You can use String#replaceAll:

String str = "\\\\u003c";
str= str.replaceAll("\\\\\\\\", "\\\\");
System.out.println(str);

It looks weird because the first argument is a string defining a regular expression, and \ is a special character both in string literals and in regular expressions. To actually put a \ in our search string, we need to escape it (\\) in the literal. But to actually put a \ in the regular expression, we have to escape it at the regular expression level as well. So to literally get \\ in a string, we need write \\\\ in the string literal; and to get two literal \\ to the regular expression engine, we need to escape those as well, so we end up with \\\\\\\\. That is:

String Literal        String                      Meaning to Regex
−−−−−−−−−−−−−−−−−−−−− −−−−−−−−−−−−−−−−−−−−−−−−−−− −−−−−−−−−−−−−−−−−
\                     Escape the next character   Would depend on next char
\\                    \                           Escape the next character
\\\\                  \\                          Literal \
\\\\\\\\              \\\\                        Literal \\

In the replacement parameter, even though it's not a regex, it still treats \ and $ specially — and so we have to escape them in the replacement as well. So to get one backslash in the replacement, we need four in that string literal.

T.J. Crowder
  • 1,031,962
  • 187
  • 1,923
  • 1,875
mtyson
  • 8,196
  • 16
  • 66
  • 106
  • 1
    The only short and correct answer in this thread :) Yep both the 1st and 2nd arg have to have the \ quadrupled as actually both strings are special regex-ey strings, not regular strings. – jakub.g Jul 04 '17 at 14:37
  • 1
    @jakub.g: You should post a `replace` answer. As you said on my now-deleted answer, `replaceAll` is just the wrong tool if your goal is to replace ``\\`` with ``\``. – T.J. Crowder Jul 05 '17 at 08:19
  • 1
    @T.J.Crowder it took me a while, but I finally posted a `replace` answer! – jakub.g Apr 26 '18 at 15:57
  • 1
    This is my best answer I didn't really write! Credit goes to @T.J.Crowder! – mtyson Oct 25 '18 at 20:11
  • 1
    @mtyson - Just fleshed it out a bit. ;-) I love it when the collaborative aspect of SO works. – T.J. Crowder Oct 25 '18 at 21:28
6

Not sure if you're still looking for a solution to your problem (since you have an accepted answer) but I will still add my answer as a possible solution to the stated problem:

String str = "\\u003c";
Matcher m = Pattern.compile("(?i)\\\\u([\\da-f]{4})").matcher(str);
if (m.find()) {
    String a = String.valueOf((char) Integer.parseInt(m.group(1), 16));
    System.out.printf("Unicode String is: [%s]%n", a);
}

OUTPUT:

Unicode String is: [<]

Here is online demo of the above code

anubhava
  • 761,203
  • 64
  • 569
  • 643
  • 1
    Magically replaces "\\" with "\". Thank you – Vinay thallam Jun 14 '12 at 05:57
  • 1
    The question remains: why are the double backslahes there in the String in the first place? – user207421 Jun 14 '12 at 06:43
  • @EJP Hi. I am sure bouble backslahes are there in myString. When I assign "\u003c" to myString in my source code and soon after print it to console it gives "<".But if I read same "\u003c" from some file and assign it to myString and do out to console,it prints \u003c. My guess is that FileUtils API is escaping the backslash when reading file. – Vinay thallam Jun 14 '12 at 10:53
4

Regarding the problem of "replacing double backslashes with single backslashes" or, more generally, "replacing a simple string, containing \, with a different simple string, containing \" (which is not entirely the OP problem, but part of it):

Most of the answers in this thread mention replaceAll, which is a wrong tool for the job here. The easier tool is replace, but confusingly, the OP states that replace("\\\\", "\\") doesn't work for him, that's perhaps why all answers focus on replaceAll.

Important note for people with JavaScript background: Note that replace(CharSequence, CharSequence) in Java does replace ALL occurrences of a substring - unlike in JavaScript, where it only replaces the first one!

Replaces each substring of this string that matches the literal target sequence with the specified literal replacement sequence.

On the other hand, replaceAll(String regex, String replacement) -- more docs also here -- is treating both parameters as more than regular strings:

Note that backslashes () and dollar signs ($) in the replacement string may cause the results to be different than if it were being treated as a literal replacement string.

(this is because \ and $ can be used as backreferences to the captured regex groups, hence if you want to used them literally, you need to escape them).

In other words, both first and 2nd params of replace and replaceAll behave differently. For replace you need to double the \ in both params (standard escaping of a backslash in a string literal), whereas in replaceAll, you need to quadruple it! (standard string escape + function-specific escape)

To sum up, for simple replacements, one should stick to replace("\\\\", "\\") (it needs only one escaping, not two).

https://ideone.com/ANeMpw

System.out.println("a\\\\b\\\\c");                                 // "a\\b\\c"
System.out.println("a\\\\b\\\\c".replaceAll("\\\\\\\\", "\\\\"));  // "a\b\c"
//System.out.println("a\\\\b\\\\c".replaceAll("\\\\\\\\", "\\"));  // runtime error
System.out.println("a\\\\b\\\\c".replace("\\\\", "\\"));           // "a\b\c"

https://www.ideone.com/Fj4RCO

String str = "\\\\u003c";
System.out.println(str);                                // "\\u003c"
System.out.println(str.replaceAll("\\\\\\\\", "\\\\")); // "\u003c"
System.out.println(str.replace("\\\\", "\\"));          // "\u003c"
jakub.g
  • 38,512
  • 12
  • 92
  • 130
3

Another option, capture one of the two slashes and replace both slashes with the captured group:

public static void main(String args[])
{
    String str = "C:\\\\";
    str= str.replaceAll("(\\\\)\\\\", "$1");

    System.out.println(str);
} 
podnov
  • 115
  • 1
  • 7
1

Try using,

myString.replaceAll("[\\\\]{2}", "\\\\");

Jaykishan
  • 1,409
  • 1
  • 15
  • 26
0

This is for replacing the double back slash to single back slash

public static void main(String args[])
{
      String str = "\\u003c";
      str= str.replaceAll("\\\\", "\\\\");

      System.out.println(str);
}
  • 3
    Has something changed in java 7? This code no longer works. `Exception in thread "main" java.lang.StringIndexOutOfBoundsException: String index out of range: 1` – Wojtek Jun 17 '14 at 12:07
0

"\\u003c" does not 'belong to UTF-8 charset' at all. It is five UTF-8 characters: '\', '0', '0', '3', and 'c'. The real question here is why are the double backslashes there at all? Or, are they really there? and is your problem perhaps something completely different? If the String "\\u003c" is in your source code, there are no double backslashes in it at all at runtime, and whatever your problem may be, it doesn't concern decoding in the presence of double backslashes.

user207421
  • 305,947
  • 44
  • 307
  • 483