Executing a UDF in Pig only once

Question

I've seen this asked but I haven't seen an answer yet..

Is it possible to call a UDF from pig just a single time?

I basically want the transformation of a text file I store to handled by a single call to the java UDF. The internals of the transformation are easier to handle within Java, and the overhead is small, so I'd rather not convert the logic to pig..

The only way i've successfully called a UDF is as part of a FOREACH statement over some dataset. I thought that I could just create a dummy tuple that was of size one, then use this as part of the 'foreach', but I can't figure out the syntax to create this dummy tuple either..

The UDF does not need to return anything, it will handle the FS logic itself, I just want to be able to exec it from within the pig script as it makes more sense to instrument it here then as part of the greater workflow..

Any help would be greatly appreciated! Thanks!

How large is the file? It sounds like you shouldn't use Pig. Have you considered using the Hadoop API directly through the FileSystem class? — matterhayes, Feb 26 '14 at 02:03
The file isn't particularly large, the issue I was having is that it is a two pass process, where the second pass has to use state (in the form of a map) from the first. The two pieces logically go together, and pig is certainly necessary for all the other processing, before and after.. I was hoping to not need to seperate the pig script, and I was surprised that so far out of the documentation i've seen, you can only execute a UDF as part of a foreach iterative statement — Ian Barefoot, Feb 26 '14 at 14:15
If you really want to use Pig you could always perform a GROUP ALL and pass the entire bag of data into your UDF. In this case you just have one record essentially and the UDF can process all of it at once. — matterhayes, Feb 26 '14 at 18:56

score 0 · Answer 1 · edited May 23 '17 at 11:58

Disclaimer: It is not recommended to use Pig for such tasks. Why bother with MR, if processing fits into one CPU/RAM?

It can be done

I had similar problem and used custom StoreFunc implementation.

Pig will check that StoreLocation exists and OutputFormat is valid, so you may extend some existing Storage:

public class AdHocProcessing extends PigStorage {

    @Override
    public void putNext(Tuple tuple) throws IOException {
        // here you process input tuples...
    }

    @Override
    public void cleanupOnSuccess(String location, Job job) throws IOException {
        // Here you may close your file, db connection, etc.

    }
  }
}

Then in Pig it would look like:

input = LOAD 'some.txt'

STORE input INTO './somewhere/' USING AdHocProcessing();

You might also like to add rmf ./somewhere before STORE (like was suggested here).

score 0 · Answer 2 · answered Jun 14 '19 at 10:21

You can also do the following:

input_table = LOAD ...;
input_table_all = GROUP input_table All;
-- 'input_table_all' now includes just a single entry
output_table = FOREACH input_table_all GENERATE MyUdf(*);

Inside the UDF you will have a Tuple that includes 'all' and then the input table, you can process the table inside your UDF.

Executing a UDF in Pig only once

2 Answers2

It can be done