0

I am trying out Pig UDFs and have been reading about it. While the online content was helpful, I am still not sure if I understand how to create a complex output schema which has nested bags.

Please help.The requirement is as follows. Say for example, I am analyzing e-commerce orders data. An order can have multiple products ordered in them.

I have the product level data grouped at an order level. This is the input to my UDF. So each grouped data at an order level containing information about the products in each order is my input.

InputSchema:

(grouped_at_order, {
    (input_column_values_at_product1_level),
    (input_column_values_at_product2_level)
})

I would be computing metrics both at an order level and at a product level in UDF. For example: sum(products) is an order level metric, color of each product is a product level metric. So, ForEach row grouped at an order level sent to UDF, I want to compute the order level & item level metrics.

Expected OutputSchema:

{
 { (orders, (computed_values_at_order_level)) }, 
  {(productlevel, 
     {
      (computed_values_at_product1_level),
      (computed_values_at_product2_level),
      (computed_values_at_product3_level)
     }
   )
  }
}

The objective then is to persist the data at order level and product level in two separate output tables from pig.

Is there a better way of doing the same?

user1652054
  • 445
  • 2
  • 11
  • 23
  • What is the actual problem you are solving? It might be possible to do it without a UDF, or maybe a UDF and a `GROUP BY` – maxymoo Jun 09 '15 at 04:09
  • Yes. I have done a group by to get the data grouped at an order level. I need to send this grouped order data to udf and process order / product level metrics. – user1652054 Jun 09 '15 at 04:24
  • I'm still not sure that a udf is what you need. sum is already implemented, and color could be retrieved by a join to another table. – maxymoo Jun 09 '15 at 05:10
  • Thanks. That was an example. I am trying to compute some metrics by adding/dividing some columns based on values of some other columns. For example: If col1 == 'A'; then res = col3 + col4. But if the condition is too complex, generating it out of the available columns is messy. – user1652054 Jun 09 '15 at 06:54

1 Answers1

1

As @maxymoo said, before returning nested data from an UDF, I would check first if I really need it.

Anyway, if you do, the solution is not complicated but painfull. You just create schema, add field, then create a schema for the tuple, add the fields or the subbags into, and so on.

@Override
public Schema outputSchema(Schema input) {

    Schema statsOrderLevel = new Schema();
    statsOrderLevel.add(new FieldSchema("value", DataType.CHARARRAY));

    Schema statsOrderLevelTuple = new Schema();
    statsOrderLevelTuple.add(new FieldSchema(null, statsOrderLevel, DataType.TUPLE);

    Schema statsOrderLevelBag = new Schema();
    statsOrderLevelBag.add(new FieldSchema("stats", statsOrderLevelTuple, DataType.BAG));

    [...]

 }
glefait
  • 1,651
  • 1
  • 13
  • 11
  • Aah! Okay. Thanks! Is there a way not to define the outputschema at all. Construct the output in exec method and just return? – user1652054 Jun 09 '15 at 07:20
  • They are some benefits to create the schema. If you really need that structure, just do it. This give you some extra time to think about refactoring ;) – glefait Jun 09 '15 at 07:31
  • :) sure. will give it a try – user1652054 Jun 09 '15 at 07:34
  • @user1652054 this answer does exactly what you asked for, but in the comments you have said if *Is there a way not to define the outputschema at all*. You don't have to build a schema, it is not necessary at all. In fact, you could just specify it yourself when calling the UDF with an `AS (blablah: bag...` clause. However, if you want to call this UDF multiple times, it would be much better for you to build the schema in the UDF with this approach, so you won't have to specify the schema everytime you call the UDF. – Balduz Jun 09 '15 at 08:53
  • Per suggestion finally after tries, decided to go on with out defining an output schema. Thanks! – user1652054 Jun 09 '15 at 17:36