-2

I have a script which is reading some data from one server and storing it in a file. But the file seems somehow corrupt. I can print it to the display, but checking the file with file produces

bash$ file -I filename  
filename: text/plain; charset=unknown-8bit

Why is it telling me that the encoding is unknown? The first line of the file displays for me as

“The Galaxy A5 and A3 offer a beautifully crafted full metal unibody

A hex dump reveals that the first three bytes are 0xE2, 0x80, 0x9C followed by the regular ASCII text The Galaxy A5...

What's wrong? Why does file tell me the encoding is unknown, and what is it actually?

tripleee
  • 175,061
  • 34
  • 275
  • 318
neel
  • 8,399
  • 7
  • 36
  • 50
  • 1
    Without access to the file *or* some sort of indication what's in it and how it's broken; no, we cannot. Maybe check out the [`character-encoding` tag wiki](http://stackoverflow.com/tags/character-encoding/info) for some tips for how to ask a moderately intelligent question. – tripleee Sep 07 '15 at 12:15
  • No, "Mac" does not silently change any encodings. Maybe something in your particular workflow is, but since we have no idea what that is, we cannot help you with it. – deceze Sep 07 '15 at 12:55
  • simply printing on the terminal is working perfectly fine. Just redirecting it to the file is creating issues. – neel Sep 07 '15 at 12:57
  • So your only claim to "failure" is the output if `file`? Maybe `file` simply *cannot guess the encoding*, but other than that everything's fine...!? – deceze Sep 07 '15 at 13:28
  • yeah, but even when I am telling the editor or vi to open as utf-8, it is unable to do so – neel Sep 08 '15 at 05:21
  • Then maybe it's not UTF-8. Why do you think it should be? – deceze Sep 08 '15 at 06:16
  • Then what can be the issue? How it is working when simply printing on the terminal – neel Sep 08 '15 at 06:22
  • I checked it on Linux, its working fine there. – neel Sep 08 '15 at 06:32
  • 1
    You continue to fail to supply the information we have requested. A file which is not UTF-8 will print fine in the terminal if your terminal is set up to use the same encoding as the file. But we cannot guess what it is; all you are telling is is that it is definitely not UTF-8. – tripleee Sep 08 '15 at 06:48
  • What more information should I supply? Can you try to elaborate? – neel Sep 08 '15 at 06:50
  • 1
    The very first comment up there has some concrete advice. Briefly, a hex dump of a few bytes should already do wonders, especially if those bytes are something else than plain 7-bit ASCII, and especially especially if you can tell us what they should display as. – tripleee Sep 08 '15 at 06:51
  • I am getting this "â<80><9c>The Galaxy A5 and A3 offer a beautifully crafted full metal unibody" inspite of "“The Galaxy A5 and A3 offer a beautifully crafted full metal unibody" – neel Sep 08 '15 at 06:52
  • I updated your question with hopefully relevant and correct information from comments. Please review, especially where I had to guess. – tripleee Sep 08 '15 at 07:13
  • Thats fine, thanks for editing. – neel Sep 08 '15 at 07:16

2 Answers2

1

Based on the information in the question, the file is a perfectly good UTF-8 file. The first three bytes encode LEFT DOUBLE QUOTATION MARK (U+201C) aka a curly quote.

Maybe your version of file is really old.

tripleee
  • 175,061
  • 34
  • 275
  • 318
  • If you are still having trouble, maybe post a new question with adequate diagnostics. You can bisect the file to find the problematic bytes (delete half, see if the problem persists; if not, restore the problematic half, and proceed iteratively with deleting half of *that*, etc.) – tripleee Sep 08 '15 at 08:11
0

You can use iconv to convert the file into the desired charset. E.G.

iconv --from-code=UTF8 --to-code=YOURTARGET

To get a list of supported targets, use the --list flag.

Joe Zitzelberger
  • 4,238
  • 2
  • 28
  • 42