2

I am getting following encoded html as a json response and has no idea how to decode it to normal html string, which is an achor tag by the way.

x3ca hrefx3dx22http:\/\/wordnetweb.princeton.edu\/perl\/webwn?sx3dstrandx22x3ehttp:\/\/wordnetweb.princeton.edu\/perl\/webwn?sx3dstrandx3c\/ax3e

I have tried java.net.UrlDecoder.decode without anyluck.

Waqas
  • 6,812
  • 2
  • 33
  • 50
  • That's not JSON at all. Where is this data coming from that is claiming it is JSON? – Tyler Sep 23 '10 at 06:05
  • here is the actual JSON response [{"type":"text","text":"Resentment - B\x27Day is the second studio album by American R\x26B singer Beyoncé Knowles, released September 4, 2006, on Columbia Records in collaboration with Music World Music and Sony Urban Music. Its release coincided with Knowles\x27 twenty-fifth birthday. ...","language":"en"},{"type":"url","text":"\x3ca href\x3d\x22http://en.wikipedia.org/wiki/Resentment_(song)\x22\x3ehttp://en.wikipedia.org/wiki/Resentment_(song)\x3c/a\x3e","language":"en"}] – Waqas Sep 23 '10 at 06:13

4 Answers4

7

The term you search for are "UTF8 Code Units". These Code units are basically a backslash, followed by a "x" and a hex ascii code. I wrote a little converter method for you:

public static String convertUTF8Units(String input) {
    String part = "", output = input;
    for(int i=0;i<=input.length()-4;i++) {
        part = input.substring(i, i+4);
        if(part.startsWith("\\x")) {
            byte[] rawByte = new byte[1];
            rawByte[0] = (byte) (Integer.parseInt(part.substring(2), 16) & 0x000000FF);
            String raw = new String(rawByte);
            output = output.replace(part, raw);
        }
    }

    return output;
}

I know, its a bit frowzy, but it works :)

Keenora Fluffball
  • 1,647
  • 2
  • 18
  • 34
  • thanks Keenora, but I already did it using regular expression – Waqas Sep 27 '10 at 08:41
  • I needed it for PowerShell and I could not get it converted in a fast way, then I found a way simpler method here: https://stackoverflow.com/a/49344121/2964949 – Patrick Oct 31 '18 at 10:48
1

That's not an encoding I've seen before, but it looks like xYZ (where Y and Z are hex digits [0-9a-f]) means "the character whose ascii code is 0xYZ". I'm not sure how the letter x itself would be encoded, so I would recommend trying to find out. But then you can just do a find and replace on the regex x([0-9a-f]{2}), by getting the integer represented by the two hex numbers, and then casting it to a char (or something similar to that).

Then also, it looks like slashes (and other characters? See if you can find out...) always have a backslash in front of them, so do another find-and-replace for that.

Community
  • 1
  • 1
Tyler
  • 21,762
  • 11
  • 61
  • 90
  • You should also try to figure out how unicode characters above `ff` would be represented, and be sure to modify your approach accordingly. – Tyler Sep 23 '10 at 06:07
  • i faced same problem in retrieving rarbic json data in this link https://www.facebook.com/feeds/page.php?id=103622369714881&format=json can y tell me please what did you do ?? – eng.ahmed Sep 05 '13 at 03:34
1

Thanks!!

Take care, in the for the operator must be "<=" else one character can't be decoded.

for(int i=0;i<=input.length()-4;i++) {..}

Cheers!

-2

This works for me

    public static String convertUTF8Units_version2(String input) throws UnsupportedEncodingException
    {
         return URLDecoder.decode(input.replaceAll("\\\\x", "%"),"UTF-8");
    }
jimbo
  • 1