Mac issue with file encoding

Question

I have a script which is reading some data from one server and storing it in a file. But the file seems somehow corrupt. I can print it to the display, but checking the file with file produces

bash$ file -I filename  
filename: text/plain; charset=unknown-8bit

Why is it telling me that the encoding is unknown? The first line of the file displays for me as

“The Galaxy A5 and A3 offer a beautifully crafted full metal unibody

A hex dump reveals that the first three bytes are 0xE2, 0x80, 0x9C followed by the regular ASCII text The Galaxy A5...

What's wrong? Why does file tell me the encoding is unknown, and what is it actually?

Without access to the file *or* some sort of indication what's in it and how it's broken; no, we cannot. Maybe check out the [`character-encoding` tag wiki](http://stackoverflow.com/tags/character-encoding/info) for some tips for how to ask a moderately intelligent question. — tripleee, Sep 07 '15 at 12:15
No, "Mac" does not silently change any encodings. Maybe something in your particular workflow is, but since we have no idea what that is, we cannot help you with it. — deceze, Sep 07 '15 at 12:55
simply printing on the terminal is working perfectly fine. Just redirecting it to the file is creating issues. — neel, Sep 07 '15 at 12:57
So your only claim to "failure" is the output if `file`? Maybe `file` simply *cannot guess the encoding*, but other than that everything's fine...!? — deceze, Sep 07 '15 at 13:28
yeah, but even when I am telling the editor or vi to open as utf-8, it is unable to do so — neel, Sep 08 '15 at 05:21
Then what can be the issue? How it is working when simply printing on the terminal — neel, Sep 08 '15 at 06:22
You continue to fail to supply the information we have requested. A file which is not UTF-8 will print fine in the terminal if your terminal is set up to use the same encoding as the file. But we cannot guess what it is; all you are telling is is that it is definitely not UTF-8. — tripleee, Sep 08 '15 at 06:48
What more information should I supply? Can you try to elaborate? — neel, Sep 08 '15 at 06:50
The very first comment up there has some concrete advice. Briefly, a hex dump of a few bytes should already do wonders, especially if those bytes are something else than plain 7-bit ASCII, and especially especially if you can tell us what they should display as. — tripleee, Sep 08 '15 at 06:51
I am getting this "â<80><9c>The Galaxy A5 and A3 offer a beautifully crafted full metal unibody" inspite of "“The Galaxy A5 and A3 offer a beautifully crafted full metal unibody" — neel, Sep 08 '15 at 06:52
I updated your question with hopefully relevant and correct information from comments. Please review, especially where I had to guess. — tripleee, Sep 08 '15 at 07:13

score 1 · Answer 1 · answered Sep 08 '15 at 07:16

1

Based on the information in the question, the file is a perfectly good UTF-8 file. The first three bytes encode LEFT DOUBLE QUOTATION MARK (U+201C) aka a curly quote.

Maybe your version of file is really old.

answered Sep 08 '15 at 07:16

tripleee

175,061
34
275
318

If you are still having trouble, maybe post a new question with adequate diagnostics. You can bisect the file to find the problematic bytes (delete half, see if the problem persists; if not, restore the problematic half, and proceed iteratively with deleting half of *that*, etc.) – tripleee Sep 08 '15 at 08:11

Joe Zitzelberger · Answer 2 · 2015-09-07T12:50:56.067

0

You can use iconv to convert the file into the desired charset. E.G.

iconv --from-code=UTF8 --to-code=YOURTARGET

To get a list of supported targets, use the --list flag.

edited Sep 07 '15 at 12:50

answered Sep 07 '15 at 12:13

Joe Zitzelberger

4,238
2
28
42

I tried it, it is showing iconv: conversion from unknown-8bit unsupported – neel Sep 07 '15 at 12:15
Do you know the charset from the remote server? You should be able to specify it to do the conversion. – Joe Zitzelberger Sep 07 '15 at 12:31
Even if it is unknown, you should be able to specify the --from=UTF8 flag to override the assumptions and force proper conversion. – Joe Zitzelberger Sep 23 '15 at 05:15

Mac issue with file encoding

2 Answers2