0

I have a file with garbled Japanese text and need to convert it back to readable Japanese. The problem is that a) I don't know which encoding the original text used, and b) I don't know much about encodings and decodings and how to even go about converting one to the other.

If I do a less on the file's content it shows as

ã<U+0081>“ã‚“ã<U+0081>«ã<U+0081>¡ã<U+0081>¯

If I open it in a text editor I see

ã“ã‚“ã«ã¡ã¯

I'm on a Mac and know there's one command called iconv, but so far all attempts to decode failed.

How can I convert that back to readable Japanese?

Alex Ixeras
  • 160
  • 1
  • 11
  • 2
    If garbled, it might not be possible. Text files are a sequence of bytes that represent integers called code units that are produced by a character encoding from codepoints in a character set. The fundamental rule is to read with the encoding the text was written with. To do that, you obviously need metadata, which is probably not stored with the bytes in the file. Any program that you don't tell which encoding to use is just going to guess. Please [edit] to show the [bytes](http://charlespatricknewman.com/blog/mac-os-x-easy-way-to-do-a-hex-dump-of-a-file/) from the file. EUC-JP → 釃釩"祀磧祚 – Tom Blodget Nov 28 '17 at 04:23

2 Answers2

2

The text you pasted appears to be the CP1252 representation of UTF8. In other words, your text is UTF8.

On many Linux systems, you can execute 'man cp1252' to get the codepoints defined in CP1252. Here are the characters I'm seeing in your pasted text:

   343   227   E3     ã     LATIN SMALL LETTER A WITH TILDE
   202   130   82     ‚     SINGLE LOW-9 QUOTATION MARK
   223   147   93     “     LEFT DOUBLE QUOTATION MARK
   253   171   AB     «     LEFT-POINTING DOUBLE ANGLE QUOTATION MARK
   241   161   A1     ¡     INVERTED EXCLAMATION MARK
   257   175   AF     ¯     MACRON

The text you pasted:

ã<U+0081>“ã‚“ã<U+0081>«ã<U+0081>¡ã<U+0081>¯

Thus becomes:

\xE3\x81\x93 \xE3\x82\x93 \xE3\x81\xAB \xE3\x81\xA1 \xE3\x81\xAF

We can ask e.g. perl to print this like this:

perl -e 'print "\xE3\x81\x93\xE3\x82\x93\xE3\x81\xAB\xE3\x81\xA1\xE3\x81\xAF"'
こんにちは
sneep
  • 1,828
  • 14
  • 19
0

On Mac there is a number of text editors that allow you to convert garbled documents to a readable format/encoding.

You can for example use BBEdit (demo-mode/lite version) to "Reopen using encoding..." and select the encoding that will properly display the file.

brandelune
  • 116
  • 1
  • 7