I have a file in hdfs on which I need to run an awk script. I would then save the result in another hdfs location. One way would be to download the hdfs file on local and then run awk manipulations. Another way is to pipe the results of the cat on the hdfs file to awk.
Is there a way to delegate this responsibility to the map-reduce framework, as these files are very huge and have millions of records.
I found this article on using Hadoop streaming, but I am not able to find the streaming jar. https://dzone.com/articles/using-awk-and-friends-hadoop