I've got a bunch of 100GB files on hdfs with mixed file-encodings (unfortunately in Azure blob storage). How can I determine the file encodings of each file? Some dfs command-line command would be ideal. Thanks.
Asked
Active
Viewed 4,530 times
1
-
Did you set the "Content-Encoding" when uploading the files? If yes, you can get it from the properties of the blobs. If no, you can get part of a blob as binary, and use an encoding detection program to guess the encoding of the blob. Here is a python package for detecting encoding: [chardet](https://pypi.python.org/pypi/chardet) – Jack Zeng Mar 24 '16 at 01:36
2 Answers
2
I ended up getting the results I needed by piping the beginning of each file in blob storage to a local buffer and then applying the file
unix utility. Here's what the command looks like for an individual file:
hdfs dfs -cat wasb://container@account.blob.core.windows.net/path/to/file | head -n 10 > buffer; file -i buffer
This gets you something like:
buffer: text/plain; charset=us-ascii

conner.xyz
- 6,273
- 8
- 39
- 65
0
You can try https://azure.microsoft.com/en-us/documentation/articles/xplat-cli-install/
The command azure storage blob list
and azure storage blob show
will return all the available blob properties including contentType, contentLength, metadata.
If this information doesn't contain what you want - file-encodings, I think you need to define/set your own metadata
like file-encoding
for each files. Then you can retrieve it back via the CLI tool.

Haibo Song
- 71
- 1