3

I'm using webhdfs to ingest data from Local file system to HDFS. Now I want to ensure integrity of files ingested into HDFS.

How can I make sure transferred files are not corrrupted/altered etc?

I used below webhdfs command to get the checksum of file

curl -i -L --negotiate -u: -X GET "http://$hostname:$port/webhdfs/v1/user/path?op=GETFILECHECKSUM"

How should I use above checksum to ensure the integrity of Ingested files? please suggest

Below is the steps I'm following

>md5sum locale_file
740c461879b484f4f5960aa4f67a145b

 >hadoop fs -checksum locale_file
locale_file     MD5-of-0MD5-of-512CRC32C        000002000000000000000000f4ec0c298cd6196ffdd8148ae536c9fe

Checksum of file on local system is different than same file on HDFS I need to compare checksum how can I do that?

Chhaya Vishwakarma
  • 1,407
  • 9
  • 44
  • 72

3 Answers3

3

One way to do that will be to calculate the checksum locally and than match it against the hadoop checksum after you ingest it.

I wrote a library to calculate check sum locally for it, in case any body is interested. https://github.com/srch07/HDFSChecksumForLocalfile

Abhishek Anand
  • 1,940
  • 14
  • 27
2

Try this

curl -i "http://<HOST>:<PORT>/webhdfs/v1/<PATH>?op=GETFILECHECKSUM"

Refer follow link for full information

https://hadoop.apache.org/docs/r2.6.0/hadoop-project-dist/hadoop-hdfs/WebHDFS.html#Get_File_Checksum

BruceWayne
  • 3,286
  • 4
  • 25
  • 35
1

It can be done from the console like below

$ md5sum locale_file
740c461879b484f4f5960aa4f67a145b

$ hadoop fs -cat locale_file |md5sum -
740c461879b484f4f5960aa4f67a145b -

You can also verify local file via code

import java.io._
import org.apache.commons.codec.digest.DigestUtils;

val md5sum = DigestUtils.md5Hex("locale_file")

and for the Hadoop

import org.apache.hadoop.fs._
import org.apache.hadoop.io._

val md5sum = MD5Hash.digest(FileSystem.get(hadoopConfiguration).open(new Path("locale_file"))).toString
maxmithun
  • 1,089
  • 9
  • 18