i am facing a lot of difficulties trying to load certain directories and process them.
the idea is i want to process all unprocessed files. in order to do so, i store my process timestamp inside hdfs everytime i finished processing. that way it'll be much easier to determine whether the files are processed or not (by measuring last processing timestamp and current timestamp).
here's my script:
--process latest
register hdfs:/udf/myudf.jar
define toDate tech.main.tics.convertDate();
define startTS tech.main.tics.startTS();
define endTS tech.main.tics.endTS();
raw = LOAD 'hdfs:/home/raw/report/last_process_time/part-r-00000' AS DATE;
start_ts = foreach raw generate startTS(DATE);
end_ts = FOREACH raw GENERATE endTS(ToUnixTime(CurrentTime()));
store start_ts into /home/raw/report/start-ts
store end_ts into /home/raw/report/end-ts
run -param START=/home/raw/report/start-ts/part-m-00000 -param END=/home/raw/report/end-ts/part-r-00000 hdfs:/home/raw/pig-script/update_test.pig
and here's my update_test.pig
register 'hdfs:/udf/elephant-bird-pig-4.10.jar';
register 'hdfs:/udf/elephant-bird-core-4.10.jar';
register 'hdfs:/udf/elephant-bird-hadoop-compat-4.10.jar';
register 'hdfs:/udf/json-simple-1.1.1.jar';
register hdfs:/udf/myudf.jar
define toDate tech.main.tics.convertDate();
define toBag tech.main.tics.MapToBag();
last_processed = LOAD 'hdfs:/home/raw/report/last_process_time/part-r-00000' AS (DATE);
previous1 = LOAD 'hdfs:/home/raw/report/events_by_application/part-r-00000';
raw = LOAD '/home/raw/dummy-logs/{$START..$END}/*' USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad') AS (json:map[]);
scene = foreach raw generate
(float)json#'value' AS VALUE,
(long)json#'ts' AS TS,
toDate(json#'ts') AS DATE;
store scene into 'hdfs:/home/raw/report2/total-scene';
--temporarily disabled
--rmf /home/raw/report/
--fs -mv /home/raw/report2/. /home/raw/report
--rmf /home/raw/report2
PIG kept reading my substituted parameter as path instead of its content.
i wonder what have i done wrong?
thanks