1

I'm working on wrangling some data to ingest into Hive. The problem is, I have overwrites in my historical data so I need to include the file name in the text files so that I can dispose of the duplicated rows which have been updated in subsequent files.

The way I've chosen to go about this is to use awk to add the file name to each file, then after I ingest into Hive I can use HQL to filter out my deprecated rows.

Here is my sample data (tab-delimited):

animal  legs    eyes
hippo   4       2
spider  8       8
crab    8       2
mite    6       0
bird    2       2

I've named it long_name_20180901.txt

I've figured out how to add my new column from this post:

awk '{print FILENAME (NF?"\t":"") $0}' long_name_20180901.txt

which results in:

long_name_20180901.txt  animal  legs    eyes
long_name_20180901.txt  hippo   4       2
long_name_20180901.txt  spider  8       8
long_name_20180901.txt  crab    8       2
long_name_20180901.txt  mite    6       0
long_name_20180901.txt  bird    2       2

But, being a beginner, I don't know how to augment this command to:

  1. make the column name (first line) something like "file_name"
  2. implement regex in awk to just extract the part of the file name that I need, and dispose of the rest. I really just want "long_name_(.{8,}).txt" (the stuff in the capturing group.

Target output is:

file  animal  legs    eyes
20180901  spider  8       8
20180901  crab    8       2
20180901  mite    6       0
20180901  bird    2       2

Thanks for your time!! I'm a total newbie to awk.

Zafar
  • 1,897
  • 15
  • 33
  • 1
    What about simply using a HQL function instead?? cf. https://stackoverflow.com/a/16719530/5162372 – Samson Scharfrichter Feb 26 '19 at 20:24
  • thanks for the creative thinking here. I'll probably go this route since all my files are compressed and the decompress/recompress time needed to implement awk wont make sense. – Zafar Feb 26 '19 at 20:53

2 Answers2

1

You can use BEGIN that sets the "file" and then reset it to use the filename for the rest.

awk 'BEGIN{f="file\t"} NF{print f $0; if (f=="file\t") {l=split(FILENAME, a, /[_.]/); f=a[l-1]"\t"};}' long_name_20180901.txt
P.P
  • 117,907
  • 20
  • 175
  • 238
1

This would handle one or multiple input files:

awk -v OFS='\t' '
    NR==1 { print "file", $0 }
    FNR==1 { n=split(FILENAME,t,/[_.]/); fname=t[n-1]; next }
    { print fname, $0 }
' *.txt
Ed Morton
  • 188,023
  • 17
  • 78
  • 185