0

I started using hadoop fsimage to validate our ETL process loaded the data correctly (with proper filesize). I parse the image and make it available via Impala. I noticed, that for all the files my query shows as incorrectly loaded (filesize is wrong), the filesize shown in the fsimage is 2147483647.
However if I look at HDFS with hadoop fs -du , I get a different (and correct) filesize. Any ideas why fsimage would show this number? If I get a new image and search again, the value is still incorrect, no matter how many days in the past I look.

EDIT: The code for getting the image was not developed by me:

sudo ssh hostname 'hdfs oiv -p Delimited -i $(ls -t /dfs/nn/current/fsimage_* | grep -v md5 | head -1) -o /dev/stdout 2>/dev/null' | grep -v "/.Trash/" |sed -e 's/\r/\\r/g' | awk 'BEGIN { FS="\t"; OFS="\t" } $0 !~ /_impala_insert_staging/ && ($0 ~ /^\/user\/hive\/warehouse\/cz_prd/ ||   
$0 ~ /^\/user\/hive\/warehouse\/cz_tst/) { split($1,a,"/"); db=a[5]; table=a[6]; gsub(".db$", "", table); } db && $10 ~ /^d/ {par=""; for(i=7;i<=length(a);i++) par=par"/"a[i] } db && $10 !~ /^d/ { par=""; for(i=7;i<=length(a) - 1;i++) par=par"/"a[i]; file=a[length(a)] } NR > 1 { print db,table, par, file, $0 }' | hadoop fs -put -f - 
/user/hive/warehouse/cz_prd_mon_ma.db/hive_warehouse_files/fsimage.tsv
k_mishap
  • 451
  • 2
  • 8
  • 17

1 Answers1

0

stupid as I am, I had the SQL table definition typed to int.
When I displayed the file with hadoop fs -cat command, it looked ok so I changed the column to bigint and now it is displaying the size correctly.

k_mishap
  • 451
  • 2
  • 8
  • 17