0

Background

I need to parse some string from HTML that is of a URL (seems it's inside JSON), so I tried to use org.apache.commons.text.StringEscapeUtils.unescapeJson.

An example of such a URL started with this as the input:

https:\/\/scontent.cdninstagram.com\/v\/t51.2885-19\/40405422_462181764265305_1222152915674726400_n.jpg?stp=dst-jpg_s150x150\\u0026

The problem

It seems it had some characters that weren't handled so if I perform this:

val test="https:\\/\\/scontent.cdninstagram.com\\/v\\/t51.2885-19\\/40405422_462181764265305_1222152915674726400_n.jpg?stp=dst-jpg_s150x150\\\\u0026\n"
Log.d("AppLog", "${StringEscapeUtils.unescapeJson(test)}")

the result is:

https://scontent.cdninstagram.com/v/t51.2885-19/40405422_462181764265305_1222152915674726400_n.jpg?stp=dst-jpg_s150x150\u0026

You can see that there is still "0026" in it, so I've found that using this solved it:

StringEscapeUtils.unescapeJson(input).replace("\\u0026","&").replace("\\/", "/") 

This works, but I think I should use something more official, as it might fail due to too-direct replacing of substrings.

What I've tried

Looking at unescapeJson code (which is the same for Java&Json, it seems), I thought that maybe I could just add the rules:

/**based on StringEscapeUtils.unescapeJson, but with addition of 2 more rules*/
fun unescapeUrl(input: String): String {
    val unescapeJavaMap= hashMapOf<CharSequence, CharSequence>(
        "\\\\" to "\\",
        "\\\\" to "\\",
        "\\\"" to "\"",
        "\\'" to "'",
        "\\" to StringUtils.EMPTY,
        //added rules:
        "\\u0026" to "&",
        "\\/" to "/"
    )
    val aggregateTranslator = AggregateTranslator(
        OctalUnescaper(),
        UnicodeUnescaper(),
        LookupTranslator(EntityArrays.JAVA_CTRL_CHARS_UNESCAPE),
        LookupTranslator(Collections.unmodifiableMap(unescapeJavaMap))
    )
    return aggregateTranslator.translate(input)
}

This doesn't work. It leaves the string with "\u0026" in it.

The questions

  1. What did I do wrong here? How can I fix this?

  2. It is true it's best to use something similar to the original code, instead of using "replace", right?

BTW, I use this on Android using Kotlin, but same can be done on Java on PC.

android developer
  • 114,585
  • 152
  • 739
  • 1,270

1 Answers1

0

Let me just give you my working example using StringEscapeUtils.unescapeJson(input) without replace. I've also looked into the StringEscapeUtils source code, which might help you a bit.

Here is my working Kotlin code (Java works the same in my test).

fun main(args: Array<String>) {
    val input = "Hello ampersand \\u0026 and forward slash \\/"
    println(input)

    val output1 = StringEscapeUtils.unescapeJson(input)
    println(output1)

    val output2 = StringEscapeUtils.unescapeJson(input).replace("\\u0026", "&").replace("\\/", "/")
    println(output2)
}

Output:

Hello ampersand \u0026 and forward slash \/ - original input
Hello ampersand & and forward slash /       - StringEscapeUtils.unescapeJson(input)
Hello ampersand & and forward slash /       - StringEscapeUtils.unescapeJson(input).replace...

As you can see, the outputs are identical regardless of using the replace logic. I'm using org.apache.commons:commons-text:1.10.0.

If we look into their source code, it's could be clear that we don't have to add any replace("\\u0026", "&").replace("\\/", "/") because:

  • the escaped-unicode representation of ampersand is handled by UnicodeUnescaper. You can see it being used in your unescapeUrl originally replicated from UNESCAPE_JAVA implementation.
  • the \\/ string is handled by another existing rule at UNESCAPE_JAVA, which is unescapeJavaMap.put("\\", StringUtils.EMPTY) and also replicated in your unescapeUrl.

So, answering your questions (NB: also see the UPDATE below taking into account the "broken" input from the author, which was posted later):

  1. Not obvious what is wrong in your example using just StringEscapeUtils.unescapeJson(input) as you can see, it works in my Kotlin example (Java as well). Maybe the version of the "common-text" library? But I doubt that. I'm also using PC, not Android. See the UPDATE below explaining the "broken" input posted later by the author and how to deal with that.
  2. It is true, totally agree. And in this particular example, not even "something similar". You should be fine using the out-of-the-box method. No need to customise that in either way.

I hope, this answer helps. Also, as it was mentioned in the comments, a good example from you would be very helpful!

UPDATE: Looking into the author's example (posted later), I can see that the escaped-unicode representation of ampersand is sort of double-escaped in the input as \\u0026 instead of \u0026. Thus, the problem. If you look into the source code of that UNESCAPE_JAVA (UNESCAPE_JSON), you will see that the \\ string get transformed into a single backslash \ as in unescapeJavaMap.put("\\\\", "\\"), and then in that translators iteration the index advances by 2 as two characters have been replaced, which places the index at the u character.

I would say, this is the upstream problem that sends you a badly formatted string. Ideally, it should be fixed so that they don't double escape the characters represented in escaped-unicode format. Then \\u0026 should become \u0026.

You can also compose your own AggregateTranslator the way it properly handles this scenario. There might be few options but they could all be error-prone and stop working properly in other scenarios. So, you have to be careful with that.

You can also run the unescapeJson method twice and it works in your particular example as in StringEscapeUtils.unescapeJson(StringEscapeUtils.unescapeJson(input)). But obviously, you could easily over-unescape the input.

Dmitry Khamitov
  • 3,061
  • 13
  • 21
  • You want me to remove the "replace", but as I wrote, this means the characters stayed, and so if I try to open a URL (such as a URL to an image) it would fail. I can't let those characters stay in the URL. I've added an example of its beginning inside my post. – android developer May 14 '23 at 22:32
  • @androiddeveloper you can check out my detailed update in the answer. – Dmitry Khamitov May 15 '23 at 11:34
  • As I wrote, it doesn't work without the "replace". I had to add it. I've updated my question again, showing what happens when I don't use "replace". – android developer May 15 '23 at 11:59
  • @androiddeveloper right, like I said, you can check out my answer and all the links I've posted. Then, you can easily use either of the suggested options that perfectly work on your example. But like I said, you have to be careful with that and all of them make some assumptions about the input. So does your "replace" approach. Ideally, the input must not be that "broken" as I've explained in my answer. – Dmitry Khamitov May 15 '23 at 20:47
  • I'm not responsible about the input, and I don't understand what you are offering. Sorry – android developer May 16 '23 at 21:43