Unescaping unicode literals found in Haskell Strings

Question

The unicode for lower case s is U+0073 , which this website says is \u0073 in C and Java.

Given a file: a.txt containing:

http://www.example.com/\u0073

Let's read this with Java, and unescape the \ and see what we get:

import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Paths;
import org.apache.commons.lang3.StringEscapeUtils;

public class Main {
  public static void main(String[] args) throws IOException {
    String s2 = new String(Files.readAllBytes(Paths.get("a.txt")));
    System.out.println(s2); // prints http://www.example.com/\u0073

    String s3 = org.apache.commons.lang3.StringEscapeUtils.unescapeJava(s2);
    System.out.println(s3); // prints http://www.example.com/s
  }
}

The output is:

$ java -cp ./commons-lang3-3.4.jar:. Main
http://www.example.com/\u0073
http://www.example.com/s

The unescapeJava(s2) method call takes the \\u0073 from the file and unescapes to \u0073, which then printed becomes "s".

Can we do the same in Haskell?

Let's consume these two files with the text library:

Prelude > a <- Data.Text.IO.readFile "a.txt"
Prelude > a
"http://www.example.com/\\u0073\n"

Any expectation of automatic translation from \u0073 to s in Haskell could be confused by the \x rather than \u prefix for carrying out such expectations:

Prelude> "\x0073"
"s"

So how do I take unescapeJava(..) method in apace-common-lang, and replicate its functionality in Haskell to go from \\u0073 to \u0073, and to print this as "s"?

Can you show me an equivalent example of *reading a file* in C or Java and having `\u0073` converted to `c`? I would be incredibly surprised if any language – including C, Java, or Haskell – interpreted an "\" as an escape character *when reading a file*. — Rein Henrichs, Oct 21 '15 at 03:06
On the other hand, a literal "http://a.example/\x0073" *is* the same as "http://a.example/s" (using `OverloadedStrings`). The only difference is that Haskell uses "\x" as the prefix of a numeric escape for a hexadecimal character. — Rein Henrichs, Oct 21 '15 at 03:10
The string literal "s" is equal to "\x0073", not "\\u0073"; the latter is simply the characters `\ u 0 0 7 3`. If you want this functionality, you need to implement it yourself, although it looks like you want to parse some sort of markup language, for which there are a multitude of libraries. — user2407038, Oct 21 '15 at 03:11
@user2407038 so the W3C unit tests are all written for languages that encode the unicode character `s` as \u0073 (according to http://www.fileformat.info/info/unicode/char/0073/index.htm ). That'd be quite alarming, given that the W3C units tests are exactly for the purpose of bringing multiple parsers into line. — Rob Stewart, Oct 21 '15 at 03:16
I don't see what relevance the W3C unit tests have to how numeric character escapes work in Haskell, or why you expect `readFile` to conform to them. For that matter, you haven't shown that reading a file in C or Java conforms to them either. — Rein Henrichs, Oct 21 '15 at 03:21
`Data.Text.IO.readFile` doesn't do any parsing, it simply attempts to read a file as a unicode encoded text file, and returns the contents. I can't imagine why anyone would think this function would perform the logic of parsing a markup language. — user2407038, Oct 21 '15 at 03:23
Note that `unescapeJava` doesn't translate `\\u0073` to `\u0073`, so that it displays as `s`. It's not "removing a backslash" or anything like that. The 6 character sequence `\u0073` is (one) way of *representing* the character `s` **in Java source code**. Whereas `\\u0073` is a 7 character sequence for representing the 6 character sequence `\u0073` **in Java source code**. So `unesacpeJava` translates the 6 characters `\u0073` to `s`, and then there's no need for any special handling to display it as `s`. — Ben, Feb 10 '16 at 23:45

Rein Henrichs · Answer 1 · 2015-10-21T03:43:32.300

In your example, a and b are not equal because the contents of the files that produced them are not equal.

readFile reads the literal contents of a file using "the runtime system's locale, character set encoding, and line ending conversion settings." readFile will not parse numeric or other character escapes in W3C-compatible (or any other) form. The character "\" in a file will always be read as a literal "\", and never as the beginning of an escape sequence. I'm not sure why you expect this to behave otherwise, as I don't know of any language whose standard library automatically attempts to parse literal "\"s into escape sequences when reading the contents of files.

If you want to parse the literal text "\u0073" (That's the characters \, u, 0, 0, 7, 3, which would be displayed by Haskell as "\\u0073") as a numeric escape for the character s, you will need to write a parser or use one that someone else has written. readLitChar is such a parser, but it uses the Haskell convention, which is different from what W3C defines. However, you can see the underlying construction of lexCharE, which may help you write your own.

Even your edit confuses *reading a file* with a Text or String *literal*. — Rein Henrichs, Oct 21 '15 at 04:04
Thanks again. I've since tried to separate in the question the distinction between what is consumed from the file and what I'd like to do with the consumed string. I've since given a Java example in the question, which demonstrates what I'd like to achieve in Haskell. That is, an equivalent to the `unescapeJava(..)` method from the apache commons lang Java package. — Rob Stewart, Oct 21 '15 at 15:11

Unescaping unicode literals found in Haskell Strings

1 Answers1