0

Can someone help me how to calculate the sum of a coloumn until it reaches a certain value. Usecase: top product which produced 50% of the revenue.

Is there any library like piggybank to get it done, I couldn't find it in piggybank.

I am trying to implement UDF but I am worried is that the only way :(.

Here is the data structure looks like-

productId, totalProfitByProduct, totalProfitByCompany, totalRevenueOfCompany.

Data is in descending order on totalProfitByProduct. totalProfitByCompany, totalRevenueOfCompany remains same for every row.

Now I want to apply sum over totalProfitByProduct for each product above from the top and get the top products which generated greater than 50% of totalProfitByCompany or totalRevenueOfCompany

Community
  • 1
  • 1
user1954395
  • 112
  • 8
  • 1
    Usually you do stuff like this by calculating the sum, then order by sum desc, then limit to get top 10 or whatever. Maybe if you add some sample input data with a desired output to the question people could post some code to help you. – LiMuBei Apr 28 '15 at 10:07
  • Hey man,that's limiting the rows. But I want limiting on the sum. – user1954395 Apr 29 '15 at 12:41
  • Could you post some sample data and what you have tried? – mr2ert Apr 29 '15 at 16:36
  • here is the data structure looks like- productId, totalProfitByProduct, totalProfitByCompany, totalRevenueOfCompany. Data is in descending order on totalProfitByProduct. totalProfitByCompany, totalRevenueOfCompany remains same for every row. Now I want to apply sum over totalProfitByProduct for each product above from the top and get the top products which generated 50% of totalProfitByCompany or totalRevenueOfCompany @mr2ert – user1954395 Apr 30 '15 at 05:21

1 Answers1

0

piggybank has percentile UDf , which can be used for your requirement .

Pig Script along with the udf can help you achieve it .

KrazyGautam
  • 2,839
  • 2
  • 21
  • 31