1

I have a file in hdfs on which I need to run an awk script. I would then save the result in another hdfs location. One way would be to download the hdfs file on local and then run awk manipulations. Another way is to pipe the results of the cat on the hdfs file to awk.

Is there a way to delegate this responsibility to the map-reduce framework, as these files are very huge and have millions of records.

I found this article on using Hadoop streaming, but I am not able to find the streaming jar. https://dzone.com/articles/using-awk-and-friends-hadoop

raizsh
  • 456
  • 1
  • 6
  • 16

1 Answers1

0

Sure, you can use MapReduce (or ideally Spark) to read a file, and process it as you need.

hadoop-streaming would be possible to run awk, but I doubt there are in-depth examples on using that vs "actual" code.

OneCricketeer
  • 179,855
  • 19
  • 132
  • 245