0

I have been stuck on this problem for over twelve hours now. I have a Pig script that is running on Amazon Web Services. Currently, I am just running my script in interactive mode. I am trying to get averages on a large data set of climate readings from weather stations; however, this data doesn't have country or state information so it has to be joined with another table that does.

State Table:

719990 99999 LILLOOET                      CN CA BC WKF   +50683 -121933 +02780
719994 99999 SEDCO 710                     CN CA    CWQJ  +46500 -048500 +00000
720000 99999 BOGUS AMERICAN                US US          -99999 -999999 -99999
720001 99999 PEASON RIDGE/RANGE            US US LA K02R  +31400 -093283 +01410
720002 99999 HALLOCK(AWS)                  US US MN K03Y  +48783 -096950 +02500
720003 99999 DEER PARK(AWS)                US US WA K07S  +47967 -117433 +06720
720004 99999 MASON                         US US MI K09G  +42567 -084417 +02800
720005 99999 GASTONIA                      US US NC K0A6  +35200 -081150 +02440

Climate Table: (I realize this doesn't contain anything to satisfy the join condition, but the full data set does.)

STN--- WBAN   YEARMODA    TEMP       DEWP      SLP        STP       VISIB      WDSP     MXSPD   GUST    MAX     MIN   PRCP   SNDP   FRSHTT
010010 99999  20090101    23.3 24    15.6 24  1033.2 24  1032.0 24   13.5  6    9.6 24   17.5  999.9    27.9*   16.7   0.00G 999.9  001000
010010 99999  20090102    27.3 24    20.5 24  1026.1 24  1024.9 24   13.7  5   14.6 24   23.3  999.9    28.9    25.3*  0.00G 999.9  001000
010010 99999  20090103    25.2 24    18.4 24  1028.3 24  1027.1 24   15.5  6    4.2 24    9.7  999.9    26.2*   23.9*  0.00G 999.9  001000
010010 99999  20090104    27.7 24    23.2 24  1019.3 24  1018.1 24    6.7  6    8.6 24   13.6  999.9    29.8    24.8   0.00G 999.9  011000
010010 99999  20090105    19.3 24    13.0 24  1015.5 24  1014.3 24    5.6  6   17.5 24   25.3  999.9    26.2*   10.2*  0.05G 999.9  001000
010010 99999  20090106    12.9 24     2.9 24  1019.6 24  1018.3 24    8.2  6   15.5 24   25.3  999.9    19.0*    8.8   0.02G 999.9  001000
010010 99999  20090107    26.2 23    20.7 23   998.6 23   997.4 23    6.6  6   12.1 22   21.4  999.9    31.5    19.2*  0.00G 999.9  011000
010010 99999  20090108    21.5 24    15.2 24   995.3 24   994.1 24   12.4  5   12.8 24   25.3  999.9    24.6*   19.2*  0.05G 999.9  011000
010010 99999  20090109    27.5 23    24.5 23   982.5 23   981.3 23    7.9  5   20.2 22   33.0  999.9    34.2    20.1*  0.00G 999.9  011000
010010 99999  20090110    22.5 23    16.7 23   977.2 23   976.1 23   11.9  6   15.5 23   35.0  999.9    28.9*   17.2   0.09G 999.9  000000

I load in the climate data using TextLoader, apply a regular expression to obtain the fields, and filter out the nulls from the result set. I then do the same with the state data, but I filter it for the country being the US.

The bags have the following schema: CLIMATE_REMOVE_EMPTY: {station: int,wban: int,year: int,month: int,day: int,temp: double} STATES_FILTER_US: {station: int,wban: int,name: chararray,wmo: chararray,fips: chararray,state: chararray}

I need to perform a join operation on (station,wban) so I can get a resulting bag with the station, wban, year, month, and temps. When I perform a dump on the resulting bag, it says that it was successful; however, the dump returns 0 results. This is the output.

HadoopVersion   PigVersion      UserId  StartedAt       FinishedAt      Features
1.0.3   0.9.2-amzn      hadoop  2013-05-03 00:10:51     2013-05-03 00:12:42         HASH_JOIN,FILTER

Success!

Job Stats (time in seconds):
JobId   Maps    Reduces MaxMapTime      MinMapTIme      AvgMapTime          MaxReduceTime   MinReduceTime   AvgReduceTime   Alias   Feature Outputs
job_201305030005_0001   2       1       36      15      25      33      33      33              CLIMATE,CLIMATE_REMOVE_NULL,RAW_CLIMATE,RAW_STATES,STATES,STATES_FILTER_US,STATE_CLIMATE_JO    IN   HASH_JOIN       hdfs://10.204.30.125:9000/tmp/temp-204730737/tmp1776606203,

Input(s):
Successfully read 30587 records from: "hiddenbucket"
Successfully read 21027 records from: "hiddenbucket"

Output(s):
Successfully stored 0 records in: "hdfs://10.204.30.125:9000/tmp/temp-204730737/tmp1776606203"

Counters:
Total records written : 0
Total bytes written : 0
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0

I have no idea why my this contains 0 results. My data extraction seems correct. and the job is successful. It leads me to believe that the join condition is never satisfied. I know the input files have some data that should satisfy the join condition, but it returns absolutely nothing.

The only thing that looks suspicious is a warning that states: Encountered Warning ACCESSING_NON_EXISTENT_FIELD 26001 time(s).

I'm not exactly sure where to go from here. Since the job isn't failing, I can't see any errors or anything in debug.

I'm not sure if these mean anything, but here are other things that stand out: When I try to illustrate STATE_CLIMATE_JOIN, I get a nullPointerException - ERROR 2997: Encountered IOException. Exception : null

When I try to illustrate STATES, I get java.lang.IndexOutOfBoundsException: Index: 1, Size: 1

Here is my full code:

--Piggy Bank Functions
register file:/home/hadoop/lib/pig/piggybank.jar
DEFINE EXTRACT org.apache.pig.piggybank.evaluation.string.EXTRACT();

--Load Climate Data
RAW_CLIMATE = LOAD 'hiddenbucket' USING TextLoader as (line:chararray);
RAW_STATES= LOAD 'hiddenbucket' USING TextLoader as (line:chararray);

CLIMATE= 
  FOREACH 
    RAW_CLIMATE
  GENERATE   
    FLATTEN ((tuple(int,int,int,int,int,double))
      EXTRACT(line,'^(\\d{6})\\s+(\\d{5})\\s+(\\d{4})(\\d{2})(\\d{2})\\s+(\\d{1,3}\\.\\d{1})')
    ) 
    AS (
      station: int,
  wban: int,
  year: int,
  month: int,
  day: int,
  temp: double
    )
  ;

STATES= 
  FOREACH 
    RAW_STATES
  GENERATE   
    FLATTEN ((tuple(int,int,chararray,chararray,chararray,chararray))
      EXTRACT(line,'^(\\d{6})\\s+(\\d{5})\\s+(\\S+)\\s+(\\w{2})\\s+(\\w{2})\\s+(\\w{2})')
    ) 
    AS (
      station: int,
  wban: int,
  name: chararray,
  wmo: chararray,
      fips: chararray,
      state: chararray
      )
    ;

CLIMATE_REMOVE_NULL = FILTER CLIMATE BY station IS NOT NULL;
STATES_FILTER_US = FILTER STATES BY (fips == 'US');
STATE_CLIMATE_JOIN = JOIN CLIMATE_REMOVE_NULL BY (station), STATES_FILTER_US BY (station);

Thanks in advance. I am at a loss here.

--EDIT-- I finally got it to work! My regular expression for parsing the STATE_DATA was invalid.

  • Did you execute this in GRUNT mode and checked DUMP after each transformation. Is the transformation is as per your expectation in the grunt mode? – Rags May 03 '13 at 05:33
  • Can you check the function name EXTRACT(). I can not find a similar function in my piggybank.jar. I checked in version 0.10.0 and 0.11.0. – Rags May 03 '13 at 06:00
  • I've performed dumps on both STATES_FILTER_US and CLIMATE_REMOVE_NULL. They both seem to give me what I am expecting. As for an update, if I perform the JOIN before I filter out the STATES, as in: – user2345171 May 03 '13 at 06:23
  • If I perform the JOIN before I filter out the data by state, it will produce results but it will have several rows that I don't need and runs slowly as a result. If I filter out the STATES by state IS NULL, and then join it, it will also produce results. This is closer to what I want, but it still contains some values that aren't from the US since some other rows contain state information. Why am I able to use FILTER BY state IS NOT NULL and have a successful join but not FILTER BY fips == 'US';? – user2345171 May 03 '13 at 06:33
  • I believe EXTRACT() is valid. The extraction process works and I'm able to dump good data. – user2345171 May 03 '13 at 06:34
  • I am getting ERROR when try EXTRACT() ---------- ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1070: Could not resolve org.apache.pig.piggybank.evaluation.string.EXTRACT using imports: [, org.apache.pig.builtin., org.apache.pig.impl.builtin.] ------- Any Idea? – Rags May 03 '13 at 08:16
  • Please share your data files and piggybank.jar. I will take a look. – Rags May 03 '13 at 10:32

0 Answers0