1

I've got a bunch of 100GB files on hdfs with mixed file-encodings (unfortunately in Azure blob storage). How can I determine the file encodings of each file? Some dfs command-line command would be ideal. Thanks.

conner.xyz
  • 6,273
  • 8
  • 39
  • 65
  • Did you set the "Content-Encoding" when uploading the files? If yes, you can get it from the properties of the blobs. If no, you can get part of a blob as binary, and use an encoding detection program to guess the encoding of the blob. Here is a python package for detecting encoding: [chardet](https://pypi.python.org/pypi/chardet) – Jack Zeng Mar 24 '16 at 01:36

2 Answers2

2

I ended up getting the results I needed by piping the beginning of each file in blob storage to a local buffer and then applying the file unix utility. Here's what the command looks like for an individual file:

hdfs dfs -cat wasb://container@account.blob.core.windows.net/path/to/file | head -n 10 > buffer; file -i buffer

This gets you something like:

buffer: text/plain; charset=us-ascii
conner.xyz
  • 6,273
  • 8
  • 39
  • 65
0

You can try https://azure.microsoft.com/en-us/documentation/articles/xplat-cli-install/

The command azure storage blob list and azure storage blob show will return all the available blob properties including contentType, contentLength, metadata.

If this information doesn't contain what you want - file-encodings, I think you need to define/set your own metadata like file-encoding for each files. Then you can retrieve it back via the CLI tool.

Haibo Song
  • 71
  • 1