I'm working on wrangling some data to ingest into Hive. The problem is, I have overwrites in my historical data so I need to include the file name in the text files so that I can dispose of the duplicated rows which have been updated in subsequent files.
The way I've chosen to go about this is to use awk
to add the file name to each file, then after I ingest into Hive I can use HQL to filter out my deprecated rows.
Here is my sample data (tab-delimited):
animal legs eyes
hippo 4 2
spider 8 8
crab 8 2
mite 6 0
bird 2 2
I've named it long_name_20180901.txt
I've figured out how to add my new column from this post:
awk '{print FILENAME (NF?"\t":"") $0}' long_name_20180901.txt
which results in:
long_name_20180901.txt animal legs eyes
long_name_20180901.txt hippo 4 2
long_name_20180901.txt spider 8 8
long_name_20180901.txt crab 8 2
long_name_20180901.txt mite 6 0
long_name_20180901.txt bird 2 2
But, being a beginner, I don't know how to augment this command to:
- make the column name (first line) something like "file_name"
- implement regex in awk to just extract the part of the file name that I need, and dispose of the rest. I really just want
"long_name_(.{8,}).txt"
(the stuff in the capturing group.
Target output is:
file animal legs eyes
20180901 spider 8 8
20180901 crab 8 2
20180901 mite 6 0
20180901 bird 2 2
Thanks for your time!! I'm a total newbie to awk
.