Pull out Java error stacks from log files

Question

I have a Java application that, when erroring out, writes an error stack similar to the below for each error.

<Errors>
    <Error ErrorCode="Code" ErrorDescription="Description" ErrorInfo="" ErrorId="ID">
        <Attribute Name="ErrorCode" Value="Code"/>
        <Attribute Name="ErrorDescription" Value="Description"/>
        <Attribute Name="Key" Value="Key"/>
        <Attribute Name="Number" Value="Number"/>
        <Attribute Name="ErrorId" Value="ID"/>
        <Attribute Name="UserId" Value="User"/>
        <Attribute Name="ProgId" Value="Prog"/>
        <Stack>typical Java stack</Stack>
    </Error>
    <Error>
      Similar info to the above
    </Error>
</Errors>

I wrote a Java log parser to go through the log files and gather information about such errors and while it does work, it is slow and inefficient, especially for log files in the hundreds of megabytes. I just basically use string manipulation to detect where the start/end tags are and tally them up.

Is there a way (either via Unix grep, Python, or Java) to efficiently extract the errors and get a count of the number of times each one happens? The entire log file is not XML so I cannot use an XML parser or Xpath. Another problem I am facing is that sometimes the end of an error might roll into another file so the current file might not have the entire stack as above.

EDIT 1:

Here is what I currently have (relevant portions only to save space).

//Parse files
for (File f : allFiles) {
   System.out.println("Parsing: " + f.getAbsolutePath());
   BufferedReader br = new BufferedReader(new FileReader(f));
   String line = "";
   String fullErrorStack = "";
   while ((line = br.readLine()) != null) {     
      if (line.contains("<Errors>")) {
         fullErrorStack = line;
         while (!line.contains("</Errors>")) {
            line = br.readLine();
            try {
               fullErrorStack = fullErrorStack + line.trim() + " ";
            } catch (NullPointerException e) {
               //End of file but end of error stack is in another file.
               fullErrorStack = fullErrorStack + "</Stack></Error></Errors> ";
               break;
            }
         }
         String errorCode = fullErrorStack.substring(fullErrorStack.indexOf("ErrorCode=\"") + "ErrorCode=\"".length(), fullErrorStack.indexOf("\" ", fullErrorStack.indexOf("ErrorCode=\"")));
         String errorDescription = fullErrorStack.substring(fullErrorStack.indexOf("ErrorDescription=\"") + "ErrorDescription=\"".length(), fullErrorStack.indexOf("\" ", fullErrorStack.indexOf("ErrorDescription=\"")));
         String errorStack = fullErrorStack.substring(fullErrorStack.indexOf("<Stack>") + "<Stack>".length(), fullErrorStack.indexOf("</Stack>", fullErrorStack.indexOf("<Stack>")));
         apiErrors.add(f.getAbsolutePath() + splitter + errorCode + ": " + errorDescription + splitter + errorStack.trim());
         fullErrorStack = "";
      }
   }
}


Set<String> uniqueApiErrors = new HashSet<String>(apiErrors);
for (String uniqueApiError : uniqueApiErrors) {
    apiErrorsUnique.add(uniqueApiError + splitter + Collections.frequency(apiErrors, uniqueApiError));
}
Collections.sort(apiErrorsUnique);

EDIT 2:

Sorry for forgetting to mention the desired output. Something like the below would be ideal.

Count, ErrorCode, ErrorDescription, List of files it occurs in (if possible)

Please see edits - I wasn't sure if it would be too big to fit but I included only the important pieces. — Matt, Feb 11 '15 at 14:35
Try using StringBuilder instead of String to collect the stack trace. — Andrei Vajna II, Feb 11 '15 at 14:37

score 5 · Answer 1 · answered Feb 11 '15 at 15:03

Well, it's not technically grep, but if you're open to using other standard UNIX-esque commands, here's a one-liner that could do the job, and it should be fast (would be interested to see results on your dataset, actually):

sed -r -e '/Errors/,/<\/Errors>/!d' *.log -ne 's/.*<Error\s+ErrorCode="([^"]*)"\s+ErrorDescription="([^"]*)".*$/\1: \2/p' | sort | uniq -c | sort -nr

Assuming they're in date order, the *.log glob will also solve the problem of logs rolling (adjust to match your log naming, of course).

Sample output

From my (dubious) test data based on yours:

 10 SomeOtherCode: This extended description
  4 Code: Description
  3 ReallyBadCode: Disaster Description

Brief Explanation

Use sed to print only between selected addresses (lines, here)
Use sed again to filter these with a regex, replacing the header line with a composed unique-enough error strings (including description), similar to your Java (or at least what we can see of it)
Sort and count these unique strings
Present in descending order of frequency

Thank you for the reply. I have a few questions/comments: 1.) To make it recursively search subfolders I added this to the front "find . -type f -print0 | xargs -0" - is that the best way? 2.) Would it be possible to expand on this to add a list of each file the error occurs in? 3.) A lower priority but the above won't run on AIX due to the -r flag. I can copy the log files to RedHat Linux but running directly from AIX would be a little easier. — Matt, Feb 12 '15 at 12:54
@Matt: 1) Yes, that should work well, and allow for spaces 2) Not this way, without making it a *lot* more complex anyway (probably better then to do it in Python / Perl / Ruby / Java / Groovy etc). 3) No problem, just remove the `-r` and backslash the `+` and parentheses: `sed -e '/Errors/,/<\/Errors>/!d' *.log -ne 's/.* — declension, Feb 14 '15 at 18:31

Ed Morton · Accepted Answer · 2015-02-16T15:30:53.010

Given your updated question:

$ cat tst.awk
BEGIN{ OFS="," }
match($0,/\s+*<Error ErrorCode="([^"]+)" ErrorDescription="([^"]+)".*/,a) {
    code = a[1]
    desc[code] = a[2]
    count[code]++
    files[code][FILENAME]
}
END {
    print "Count", "ErrorCode", "ErrorDescription", "List of files it occurs in"
    for (code in desc) {
        fnames = ""
        for (fname in files[code]) {
            fnames = (fnames ? fnames " " : "") fname
        }
        print count[code], code, desc[code], fnames
    }
}
$
$ awk -f tst.awk file
Count,ErrorCode,ErrorDescription,List of files it occurs in
1,Code,Description,file

It still requires gawk 4.* for the 3rd arg to match() and 2D arrays but again that's easily worked around in any awk.

Per request in the comments here's a non-gawk version:

$ cat tst.awk
BEGIN{ OFS="," }
/[[:space:]]+*<Error / {
    split("",n2v)
    while ( match($0,/[^[:space:]]+="[^"]+/) ) {
        name = value = substr($0,RSTART,RLENGTH)
        sub(/=.*/,"",name)
        sub(/^[^=]+="/,"",value)
        $0 = substr($0,RSTART+RLENGTH)
        n2v[name] = value
    }
    code = n2v["ErrorCode"]
    desc[code] = n2v["ErrorDescription"]
    count[code]++
    if (!seen[code,FILENAME]++) {
        fnames[code] = (code in fnames ? fnames[code] " " : "") FILENAME
    }
}
END {
    print "Count", "ErrorCode", "ErrorDescription", "List of files it occurs in"
    for (code in desc) {
        print count[code], code, desc[code], fnames[code]
    }
}
$
$ awk -f tst.awk file
Count,ErrorCode,ErrorDescription,List of files it occurs in
1,Code,Description,file

There's various ways the above could be done, some briefer, but when input contains name=value pairs I like to create a name2value array (n2v[] is the name I usually give it) so I can access the values by their names. Makes the code easy to understand and modify in future to add fields, etc.

Here's my previous answer as there's some things in it you'll find usefule in other situations:

You don't say what you want the output to look like and your posted sample input isn't really adequate to test against and show useful output, but this GNU awk script shows the way to get a count of whatever attribute name/value pairs you like:

$ cat tst.awk         
match($0,/\s+*<Attribute Name="([^"]+)" Value="([^"]+)".*/,a) { count[a[1]][a[2]]++ }
END {
    print "\nIf you just want to see the count of all error codes:"
    name = "ErrorCode"
    for (value in count[name]) {
        print name, value, count[name][value]
    }

    print "\nOr if theres a few specific attributes you care about:"
    split("ErrorId ErrorCode",names,/ /)
    for (i=1; i in names; i++) {
        name = names[i]
        for (value in count[name]) {
            print name, value, count[name][value]
        }
    }

    print "\nOr if you want to see the count of all values for all attributes:"
    for (name in count) {
        for (value in count[name]) {
            print name, value, count[name][value]
        }
    }
}

.

$ gawk -f tst.awk file

If you just want to see the count of all error codes:
ErrorCode Code 1

Or if theres a few specific attributes you care about:
ErrorId ID 1
ErrorCode Code 1

Or if you want to see the count of all values for all attributes:
ErrorId ID 1
ErrorDescription Description 1
ErrorCode Code 1
Number Number 1
ProgId Prog 1
UserId User 1
Key Key 1

If you have data spread across multiple files, the above couldn't care less, just list them all on the command line:

gawk -f tst.awk file1 file2 file3 ...

It uses GNU awk 4.* for true multi-dimensional arrays, but there's trivial workarounds for any other awk if needed.

One way to run an awk command on files found recursively under a directory:

awk -f tst.awk $(find dir -type f -print)

Thank you for the reply -is it possible to use this script and pass it a folder of log file and have it recursively tally up all the errors? — Matt, Feb 12 '15 at 18:26
Do out literally mean recursively, i.e. descending directories looking for files? If the files are all at one level then just call it with dir/* instead of one file name. — Ed Morton, Feb 13 '15 at 04:11
Yes - I mean recursively. There will be log files in sub folders of other folders and I would like to parse them all. — Matt, Feb 13 '15 at 13:09
Got it. `awk` is the UNIX command to manipulate text. The UNIX command to [recursively] find files is `find`. The UNIX shell is an environment from which to call tools with a language to sequence those calls. So - you need to write a brief shell script which calls `find` to find the files and pass them to `awk` to manipulate the text inside those files. I updated my answer to show one way to do that - it'll work as long as you don't have a huge number of files and they don't contain spaces in their names but if either of those conditions is true you'll need a bit more complicated shell script. — Ed Morton, Feb 13 '15 at 13:39
Thank you again for the reply. I've been studying your response and realized that I am using Awk 3.1.7. I am not the administrator on this machine so I cannot upgrade it. You mention a trivial workaround for older versions of awk, can you please share it? — Matt, Feb 16 '15 at 13:22
I just edited my answer to include a non-gawk-specific version. — Ed Morton, Feb 16 '15 at 15:31
Thank you again. All of the answers here are excellent and give me a lot to study but this one answers everything and works as expected, hence marking it as the solution. — Matt, Feb 16 '15 at 19:41

Shane Voisard · Answer 3 · 2015-02-13T15:05:30.193

I assume that since you mention Unix grep, you may likely have perl also. Here's a simple perl solution:

#!/usr/bin/perl

my %countForErrorCode;
while (<>) { /<Error ErrorCode="([^"]*)"/ && $countForErrorCode{$1}++ }
foreach my $e (keys %countForErrorCode) { print "$countForErrorCode{$e} $e\n" }

Assuming you are running *nix, save this perl script, make it executable and run with command like...

$ ./grepError.pl *.log

you should get output like...

8 Code1
203 Code2
...

where 'Code1' etc. are the error codes captured between the double quotes in the regex.

I worked this up on Windows with Cygwin. This solution assumes:

Location of your perl is /usr/bin/perl. You can verify with $ which perl
The regex above, /<Error ErrorCode="([^"]*)"/, is how you want to count.

The code is doing...

my %errors declares a map (hash).
while (<>) iterates each line of input and assigns current line to built-in variable $_.
/<Error ErrorCode="([^"]*)"/ implicitly tries matching against $_.
When a match occurs, the parentheses capture the value between the double quotes and assign the captured string to $1.
The regex "returns true" on a match only then does the count get incremented && $countForErrorCode{$1}++.
For output, iterate the captured error codes with foreach my $e (keys %countForErrorCode) and print the count and code on a line with print "$countForErrorCode{$e} $e\n".

Edit: more detailed output per updated spec

#!/usr/bin/perl

my %dataForError;

while (<>) {
  if (/<Error ErrorCode="([^"]+)"\s*ErrorDescription="([^"]+)"/) {
    if (! $dataForError{$1}) {
      $dataForError{$1} = {}; 
      $dataForError{$1}{'desc'} = $2;
      $dataForError{$1}{'files'} = {};
    }
    $dataForError{$1}{'count'}++;
    $dataForError{$1}{'files'}{$ARGV}++;
  }
}
my @out;
foreach my $e (keys %dataForError) {
  my $files = join("\n\t", keys $dataForError{$e}{'files'});
  my $out = "$dataForError{$e}{'count'}, $e, '$dataForError{$e}{'desc'}'\n\t$files\n";
  push @out, $out;
}
print @out;

And like you posted above, to pick up input files recursively you can run this script like:

$ find . -name "*.log" | xargs grepError.pl

And produce output like:

8, Code2, 'bang'  
    ./today.log  
48, Code4, 'oops'  
    ./2015/jan/yesterday.log  
2, Code1, 'foobar'  
    ./2014/dec/someday.log

Explanation:

The script maps each unique error code to a hash that tracks the count, description and unique filenames where the error code is found.
Perl auto-magically stores the current input filename into $ARGV.
The script counts each unique filename occurance, but does not output those counts.

Thank you for the reply - but I am having a problem running your script. It errors out with "syntax error at test.pl line 20, near "push"" with line 20 being "push @out, $out;". Any ideas? — Matt, Feb 13 '15 at 13:08
Try removing the single quotes surrounding `$dataForError{$e}{'desc'}`. They are optional and only there for output formatting. I am running perl 5.14.4 on cygwin. — Shane Voisard, Feb 13 '15 at 14:57
Just noticed that the line before the push() was not semicolon terminated. I edited the answer and added the semicolon. — Shane Voisard, Feb 13 '15 at 15:06
Thank you again for the reply. I'm still however getting an error when running. This is Red Hat Linux 6.5 and Perl 5.10: "Type of arg 1 to keys must be hash (not hash element) at /home/orionoms/bin/test.pl line 18, near "})"" with line 18 being "my $files = join("\n\t", keys $dataForError{$e}{'files'});". I am calling this like ">find . -name "*.log" | xargs ~/bin/test.pl" to recursively get all log files in a folder and not all folders might have matching logs - not sure if that is the problem. — Matt, Feb 14 '15 at 21:22
Seems that in perl 5.10, I need surround`$dataForError{$e}{'files'}` in line 18 with `%{}` and the script works. I inferred this change from http://stackoverflow.com/questions/20824920/perl-array-references-and-avoiding-type-of-arg-1-to-keys-must-be-hash-error. — Shane Voisard, Feb 16 '15 at 21:22

Pull out Java error stacks from log files

3 Answers3

Sample output

Brief Explanation

It still requires gawk 4.* for the 3rd arg to match() and 2D arrays but again that's easily worked around in any awk.

Linked