1

I have a PIG script that

  • Loads and transforms the data from a csv
  • Replaces some characters

Calls a java program (JAR) to convert the date-time in csv from 06/02/2015 18:52 to 2015-6-2 18:52 (mm/DD/yyyy to yyyy-MM-dd)

REGISTER /home/cloudera/DateTime.jar;

A = Load '/user/cloudera/Data.csv' using PigStorage(',') as (ac,datetime,amt,trace);

B = FOREACH A GENERATE ac, REPLACE(datetime, '\\/','-') as newdate,REPLACE(amt,'-','') as newamt,trace;

C = FOREACH B GENERATE ac,Converter.DateTime(newdate) as ConvDate,ConvAmt,trace;

Store C into '/user/cloudera/Output/' using PigStorage('\t');

Sample Input -- 21467245 06/02/2015 18:52 -9.59 518

Sample Output -- 21467245 2015-6-2 18:52 9.59 518

I am loading the output into hive, other fields seem fine during import, but the date-time field results null if loaded as timestamp and is intact when its string.

Where is this going wrong?

Am using Cloudera CDH 5

  • there is a built-in `ToDate()` function in pig. – o-90 Jun 25 '15 at 13:54
  • @GoBrewers14 - Does it change the output format from pig to be compatible for Hive to ingest as timestamp? – Santosh Sulibhavi Jun 25 '15 at 15:34
  • @GoBrewers14 ToDate(newdate, 'MM-dd-yyyy HH:mm') gives 2015-06-02T18:52:00.000Z Hive still doesn't take this format as it needs yyyy-MM-dd as timestamp. – Santosh Sulibhavi Jun 25 '15 at 19:03
  • if you need `yyyy-MM-dd` then why would you put `yyyy-MM-dd HH:mm` in the `ToDate()` function? just put `yyyy-MM-dd` – o-90 Jun 25 '15 at 19:12
  • @GoBrewers14 ToDate(newdate, 'yyyy-MM-dd HH:mm') gives an error, I tried it before putting before comment. Backend error : org.apache.pig.backend.executionengine.ExecException: ERROR 0: Exception while executing [POUserFunc (Name: POUserFunc(org.apache.pig.builtin.ToDate2ARGS)[datetime] - scope-38 Operator Key: scope-38) children: null at []]: java.lang.IllegalArgumentException: Invalid format: "06-02-2015 18:52" is malformed at "15 18:52" As per my knowledge, Hive can take yyyy-MM-dd HH:mm format if including time, which is what am trying to achieve – Santosh Sulibhavi Jun 25 '15 at 20:13

1 Answers1

0

From the hive docs:

Timestamps in text files have to use the format yyyy-mm-dd hh:mm:ss[.f...]. If they are in another format declare them as the appropriate type (INT, FLOAT, STRING, etc.) and use a UDF to convert them to timestamps.

So you need to either change your Converter to output this format, or use a UDF --- or just keep them as strings, which is what I usually do !

maxymoo
  • 35,286
  • 11
  • 92
  • 119
  • I need to whole bunch of operations when they have been imported to hive such as deciding if it's morning or evening, Number of accounts in a given time period of time and many more. I believe I won't be able to do so if it is in a string format. Can you please help me with this? – Santosh Sulibhavi Jun 26 '15 at 12:38