1

I have the log below trying to parse it by the indicated column number 1 as Date, 2 as Time, 3 as Task, 4 as Error_Line, and 5 all the rest columns as Error_Message

|1     | |2     | |3   |     |4  | |5                                                                          |
09-15-16 05:23:45 B:VVBN     09064 Port 22 Device 10400 Remote 44 13331 Link Up RP2016
09-15-16 05:23:44 A:QAWE     09064 Port 22 Device 10400 Remote 44 13331 Link Up RP2016
09-15-16 05:23:44 B:VVBN     13425 Port 22 Device 10400 Remote 44 13331 Receive Time Error: 24666 23270 1396 69
09-15-16 05:23:43 B:QAWE     13372 Port 22 Device 10400 Remote 44 13331 Send Time Error: 444 1888 1444 69
09-15-16 05:23:43 A:VVBN     13425 Port 22 Device 10400 Remote 44 13331 Receive Time Error: 24666 23270 1396 69
09-15-16 05:23:43 A:CCBE     13372 Port 22 Device 10400 Remote 44 13331 Send Time Error: 444 1888 1444 69
09-15-16 05:21:56 B:VVBN     07270 Port 22 Device 10400 Remote 44 13331 AT Timer Expired
09-15-16 05:21:56 A:CCBE     07270 Port 22 Device 10400 Remote 44 13331 AT Timer Expired

here is my script

logs = LOAD '/data/test_log.txt' USING PigStorge(' ') AS (date: chararray, time: chararray, task: chararray, line_error: int, error_message: chararray);
date = GROUP logs BY date;

counts = FOREACH date GENERATE COUNT($4) as count;

DUMP counts;

notice there is one space between columns only there is five spaces between 3 and 4 columns. I tried the script above but it just work good for date not for last column Error_message. I am trying to get this output bag:

(09-15-16,05:23:45,B:VVBN,09064,Port 22 Device 10400 Remote 44 13331 Link Up RP2016)
(09-15-16,05:23:44,A:QAWE,09064,Port 22 Device 10400 Remote 44 13331 Link Up RP2016)
:
:

I just need to consider the first four columns any other columns in the log file mix them in one column 5.

Any suggestion to get the desired output.

Alsphere
  • 513
  • 1
  • 7
  • 22
  • As pig can “eat anything” try the loading it as a single line then `generate REGEX_EXTRACT as per `https://community.cloudera.com/t5/Support-Questions/extracting-substring-in-PIG-latin/td-p/232673 Regular Expressions are a fundamental approach to dealing with text files so worth practicing with plenty of online material including websites to test them online in the browser. – simbo1905 Apr 26 '20 at 18:05

1 Answers1

1

You need to use MyRegExLoader provided by piggybank to process custom log files.

  logs = LOAD '/data/test_log.txt' USING org.apache.pig.piggybank.storage.MyRegExLoader ('provide the regex ');
Arunakiran Nulu
  • 2,029
  • 1
  • 10
  • 16