1

I have a file which contains entries like this:

1,1,07 2012,07 2013,11,blablabla

The two first fields are ids. The third is the begin date(month year) and the fourth is the end date. The fifth field is the number of months btweens these two dates. And the last field contains text.

Here is my pig code to load this data:

f = LOAD 'file.txt' USING PigStorage(',') AS (id1:int, id2:int, date1:chararray, date2:chararray, duration:int, text:chararray);

I would like to filter my file so that I keep only the entries where date2 is less than three years from today. Is it possible to that in Pig ?

Thanks.

user7337271
  • 1,662
  • 1
  • 14
  • 23
shanks_roux
  • 438
  • 2
  • 12
  • 26
  • 1
    You can write a filter function. [Here](http://ofps.oreilly.com/titles/9781449302641/writing_udfs.html) is an introduction of **Writing Filter Functions** (Search **Writing Filter Functions** on this page). – zsxwing Jun 19 '13 at 07:48
  • Thanks, I'll watch this. – shanks_roux Jun 19 '13 at 08:27

3 Answers3

6

No need to write a custom function:

In Pig 0.11 you can convert the date2 field from chararray to datetime data type using the ToDate() function, and then get the difference between the CurrentTime() and date2 using YearsBetween() and filter according to it. for example:

g = FILTER f BY YearsBetween(CurrentTime(),ToDate(date2 + ' 01', 'yyyy MM dd'))<3
Nishu Tayal
  • 20,106
  • 8
  • 49
  • 101
SNeumann
  • 1,158
  • 9
  • 12
  • 1
    That's interesting. I'm using pig 0.4 but I'll remember your solution. Thank you. – shanks_roux Jun 20 '13 at 07:57
  • 1
    If you can't use Pig 0.11's datetime data type, you might still be able to use PiggyBank's datetime UDFs that help you convert datetime chararray fields to ISO dates which are comparable using other UDFs in PiggyBank. – SNeumann Jun 26 '13 at 18:57
  • @SNeumann Is there a way to take a TOP on the DateTime field in Pig? – Navneet Dec 12 '13 at 22:50
  • @Navneet - what do you mean? You can use Pig's LIMIT to get top N results (usually used after sorting the dataset). DateTime data type is sortable so it's not a problem to sort by it. – SNeumann Dec 14 '13 at 06:55
  • Yup. That's what I meant. Sorry about not phrasing it clearly. Thanks! – Navneet Dec 14 '13 at 10:27
0

in pig 11, is there a support for comparing datetime types? for example: date1:datetime

and filter has condition: date1 >= ToDate('1999-01-01')

does this comparison returns correct result?

krcun
  • 75
  • 2
  • 9
0

If you are stuck on Pig older than .11, use datafu. They have a function UnixToIso

DEFINE UnixToISO   org.apache.pig.piggybank.evaluation.datetime.convert.UnixToISO();
Ajeet Ganga
  • 8,353
  • 10
  • 56
  • 79