0

This is my hadoop job:

hadoop streaming \
-D mapred.map.tasks=1\
-D mapred.reduce.tasks=1\
-mapper "awk '{if(\$0<3)print}'" \  # doesn't work
-reducer "cat" \
-input "/user/***/input/" \
-output "/user/***/out/"

this job always fails, with an error saying:

sh: -c: line 0: syntax error near unexpected token `('
sh: -c: line 0: `export TMPDIR='..../work/tmp'; /bin/awk { if ($0 < 3) print } '

But if I change the -mapper into this: -mapper "awk '{print}'" it works without any error. What's the problem with the if(..) ?

UPDATE:

Thank @paxdiablo for your detailed answer.

what I really want to do is filter out some data whose 1st column is greater than x, before piping the input data to my custom bin. So the -mapper actually looks like this:

-mapper "awk -v x=$x{if($0<x)print} | ./bin" 

Is there any other way to achieve that?

Alcott
  • 17,905
  • 32
  • 116
  • 173

1 Answers1

1

The problem's not with the if per se, it's to do with the fact that the quotes have been stripped from your awk command.

You'll realise this when you look at the error output:

sh: -c: line 0: `export TMPDIR='..../work/tmp'; /bin/awk { if ($0 < 3) print } '

and when you try to execute that quote-stripped command directly:

pax> echo hello | awk {if($0<3)print}
bash: syntax error near unexpected token `('

pax> echo hello | awk {print}
hello

The reason the {print} one works is because it doesn't contain the shell-special ( character.

One thing you might want to try is to escape the special characters to ensure the shell doesn't try to interpret them:

{if\(\$0\<3\)print}

It may take some effort to get the correctly escaped string but you can look at the error output to see what is generated. I've had to escape the () since they're shell sub-shell creation commands, the $ to prevent variable expansion, and the < to prevent input redirection.


Also keep in mind that there may be other ways to filter depending on you needs, ways that can avoid shell-special characters. If you specify what your needs are, we can possibly help further.

For example, you could create an shell script (eg, pax.sh) to do the actual awk work for you:

#!/bin/bash
awk -v x=$1 'if($1<x){print}'

then use that shell script in the mapper without any special shell characters:

hadoop streaming \
  -D mapred.map.tasks=1 -D mapred.reduce.tasks=1 \
  -mapper "pax.sh 3" -reducer "cat" \
  -input "/user/***/input/" -output "/user/***/out/"
paxdiablo
  • 854,327
  • 234
  • 1,573
  • 1,953