Run awk script on a hdfs file and persist result file in hdfs

Question

I have a file in hdfs on which I need to run an awk script. I would then save the result in another hdfs location. One way would be to download the hdfs file on local and then run awk manipulations. Another way is to pipe the results of the cat on the hdfs file to awk.

Is there a way to delegate this responsibility to the map-reduce framework, as these files are very huge and have millions of records.

I found this article on using Hadoop streaming, but I am not able to find the streaming jar. https://dzone.com/articles/using-awk-and-friends-hadoop

Does this answer your question? [how to find JAR: /home/hadoop/contrib/streaming/hadoop-streaming.jar](https://stackoverflow.com/questions/32543734/how-to-find-jar-home-hadoop-contrib-streaming-hadoop-streaming-jar) — mazaneicha, Jan 27 '20 at 13:49

score 0 · Answer 1 · answered Jan 28 '20 at 17:41

0

Sure, you can use MapReduce (or ideally Spark) to read a file, and process it as you need.

hadoop-streaming would be possible to run awk, but I doubt there are in-depth examples on using that vs "actual" code.

answered Jan 28 '20 at 17:41

OneCricketeer

179,855
19
132
245

Run awk script on a hdfs file and persist result file in hdfs

1 Answers1