17

When developing Pig scripts that use the STORE command I have to delete the output directory for every run or the script stops and offers:

2012-06-19 19:22:49,680 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 6000: Output Location Validation Failed for: 'hdfs://[server]/user/[user]/foo/bar More info to follow:
Output directory hdfs://[server]/user/[user]/foo/bar already exists

So I'm searching for an in-Pig solution to automatically remove the directory, also one that doesn't choke if the directory is non-existent at call time.

In the Pig Latin Reference I found the shell command invoker fs. Unfortunately the Pig script breaks whenever anything produces an error. So I can't use

fs -rmr foo/bar

(i. e. remove recursively) since it breaks if the directory doesn't exist. For a moment I thought I may use

fs -test -e foo/bar

which is a test and shouldn't break or so I thought. However, Pig again interpretes test's return code on a non-existing directory as a failure code and breaks.

There is a JIRA ticket for the Pig project addressing my problem and suggesting an optional parameter OVERWRITE or FORCE_WRITE for the STORE command. Anyway, I'm using Pig 0.8.1 out of necessity and there is no such parameter.

valid
  • 1,858
  • 1
  • 18
  • 28

2 Answers2

43

At last I found a solution on grokbase. Since finding the solution took too long I will reproduce it here and add to it.

Suppose you want to store your output using the statement

STORE Relation INTO 'foo/bar';

Then, in order to delete the directory, you can call at the start of the script

rmf foo/bar

No ";" or quotations required since it is a shell command.

I cannot reproduce it now but at some point in time I got an error message (something about missing files) where I can only assume that rmf interfered with map/reduce. So I recommend putting the call before any relation declaration. After SETs, REGISTERs and defaults should be fine.

Example:

SET mapred.fairscheduler.pool 'inhouse';
REGISTER /usr/lib/pig/contrib/piggybank/java/piggybank.jar;
%default name 'foobar'
rmf foo/bar
Rel = LOAD 'something.tsv';
STORE Rel INTO 'foo/bar';
valid
  • 1,858
  • 1
  • 18
  • 28
  • Although this is indeed nice, it's not atomic. I would rather do it in three steps: 1) store in 'foobar-tmp' 2)rmf foo/bar 3)mv 'foobar-tmp' to foo/bar – Miguel Ping Oct 15 '13 at 13:53
  • 2
    @MiguelPing: It looks to me like your approach should run into my initial problem but for `foobar-tmp` instead of `foo/bar`. Storing first may also produce that elusive error I tentatively attributed to map/reduce. If your solution works on your side could you turn it into an answer with an example script and provide your pig version number? – valid Oct 15 '13 at 14:50
  • @valid my solution is similar to yours, I just added an extra step to guarantee that if something happens between the `rmf` and the `STORE` (say, exception) you don't lose data. Pig scripts can fail any time, so my solution isn't atomic either, but at least you don't run the risk of losing data. – Miguel Ping Oct 15 '13 at 16:57
  • Thank you so much for this! I was trying to to look for a similar function but somehow, I couldn't locate it in the official documentation. – Akshay Gaur Apr 25 '16 at 16:55
2

Once you use the fs command, there a lot of ways to do this. For an individual file, I wound up adding this to the beginning of my scripts:

-- Delete file (won't work for output, which will be a directory
-- but will work for a file that gets copied or moved during the
-- the script.)
fs -touchz top_100
rm top_100

For a directory

-- Delete dir
fs -rm -r out
Todd Nemet
  • 1,096
  • 1
  • 7
  • 4