regexp match within a log file, return dynamic content above and below match

Question

I have some catchall log files in a format as follows:

timestamp event summary
foo details
account name: userA
bar more details
timestamp event summary
baz details
account name: userB
qux more details
timestamp etc.

I would like to search the log file for userB, and if found, echo from the preceding timestamp down to (but not including) the following timestamp. There will likely be several events matching my search. It would be nice to echo some sort of --- start --- and --- end --- surrounding each match.

This would be perfect for pcregrep -M, right? Problem is, GnuWin32's pcregrep crashes with multiline regexps searching large files, and these catch-all logs can be 100 megs or more.

What I've tried

My hackish workaround thus far involves using grep -B15 -A30 to find matching lines and print surrounding content, then piping the now more manageable chunk into pcregrep for polishing. Problem is that some events are less than ten lines, while others are 30 or more; and I'm getting some unexpected results where the shorter events are encountered.

:parselog <username> <logfile>

set silent=1
set count=0
set deez=20\d\d-\d\d-\d\d \d\d:\d\d:\d\d
echo Searching %~2 for records containing %~1...

for /f "delims=" %%I in (
    'grep -P -i -B15 -A30 ":\s+\b%~1\b(@mydomain\.ext)?$" "%~2" ^| pcregrep -M -i "^%deez%(.|\n)+?\b%~1\b(@mydomain\.ext|\r?\n)(.|\n)+?\n%deez%" 2^>NUL'
) do (
    echo(%%I| findstr "^20[0-9][0-9]-[0-9][0-9]-[0-9][0-9].[0-9][0-9]:[0-9][0-9]:[0-9][0-9]" >NUL && (
        if defined silent (
            set silent=
            set found=1
            set /a "count+=1"
            echo;
            echo ---------------start of record !count!-------------
        ) else (
            set silent=1
            echo ----------------end of record !count!--------------
            echo;
        )
    )
    if not defined silent echo(%%I
)

goto :EOF

Is there a better way to do this? I've come across an awk command that looked interesting, something like:

awk "/start pattern/,/end pattern/" logfile

... but it would need to match a middle pattern as well. Unfortunately, I'm not that familiar with awk syntax. Any suggestions?

Ed Morton suggested that I supply some example logging and expected output.

Example catch-all

2013-03-25 08:02:32 Auth.Critical   169.254.8.110   Mar 25 08:02:32 dc3 MSWinEventLog   2   Security    11730158    Mon Mar 25 08:02:28 2013    529 Security    NT AUTHORITY\SYSTEM N/A Audit Failure   dc3 2   Logon Failure:

    Reason:     Unknown user name or bad password

    User Name:  user5f

    Domain:     MYDOMAIN

    Logon Type: 3

    Logon Process:  Advapi  

    Authentication Package: Negotiate

    Workstation Name:   dc3

    Caller User Name:   dc3$

    Caller Domain:  MYDOMAIN

    Caller Logon ID:    (0x0,0x3E7)

    Caller Process ID:  400

    Transited Services: -

    Source Network Address: 169.254.7.86

    Source Port:    40838
2013-03-25 08:02:32 Auth.Critical   169.254.8.110   Mar 25 08:02:32 dc3 MSWinEventLog   2   Security    11730159    Mon Mar 25 08:02:29 2013    680 Security    NT AUTHORITY\SYSTEM N/A Audit Failure   dc3 9   Logon attempt by:   MICROSOFT_AUTHENTICATION_PACKAGE_V1_0

Logon account:  USER6Q

Source Workstation: dc3

Error Code: 0xC0000234
2013-03-25 08:02:32 Auth.Critical   169.254.8.110   Mar 25 08:02:32 dc3 MSWinEventLog   2   Security    11730160    Mon Mar 25 08:02:29 2013    539 Security    NT AUTHORITY\SYSTEM N/A Audit Failure   dc3 2   Logon Failure:

    Reason:     Account locked out

    User Name:  USER6Q@MYDOMAIN.TLD

    Domain: MYDOMAIN

    Logon Type: 3

    Logon Process:  Advapi  

    Authentication Package: Negotiate

    Workstation Name:   dc3

    Caller User Name:   dc3$

    Caller Domain:  MYDOMAIN

    Caller Logon ID:    (0x0,0x3E7)

    Caller Process ID: 400

    Transited Services: -

    Source Network Address: 169.254.7.89

    Source Port:    55314
2013-03-25 08:02:32 Auth.Notice 169.254.5.62    Mar 25 08:36:38 DC4.mydomain.tld MSWinEventLog  5   Security    201326798   Mon Mar 25 08:36:37 2013    4624    Microsoft-Windows-Security-Auditing     N/A Audit Success   DC4.mydomain.tld    12544   An account was successfully logged on.

Subject:
    Security ID:        S-1-0-0
    Account Name:       -
    Account Domain:     -
    Logon ID:       0x0

Logon Type:         3

New Logon:
    Security ID:        S-1-5-21-606747145-1409082233-725345543-160838
    Account Name:       DEPTACCT16$
    Account Domain:     MYDOMAIN
    Logon ID:       0x1158e6012c
    Logon GUID:     {BCC72986-82A0-4EE9-3729-847BA6FA3A98}

Process Information:
    Process ID:     0x0
    Process Name:       -

Network Information:
    Workstation Name:   
    Source Network Address: 169.254.114.62
    Source Port:        42183

Detailed Authentication Information:
    Logon Process:      Kerberos
    Authentication Package: Kerberos
    Transited Services: -
    Package Name (NTLM only):   -
    Key Length:     0

This event is generated when a logon session is created. It is generated on the computer that was accessed.

The subject fields indicate...
2013-03-25 08:02:32 Auth.Critical   169.254.8.110   Mar 25 08:02:32 dc3 MSWinEventLog   2   Security    11730162    Mon Mar 25 08:02:30 2013    675 Security    NT AUTHORITY\SYSTEM N/A Audit Failure   dc3 9   Pre-authentication failed:

    User Name:  USER8Y

    User ID:        %{S-1-5-21-606747145-1409082233-725345543-3904}

    Service Name:   krbtgt/MYDOMAIN

    Pre-Authentication Type:    0x0

    Failure Code:   0x19

    Client Address: 169.254.87.158
2013-03-25 08:02:32 Auth.Critical   etc.

Example command

call :parselog user6q \\path\to\catch-all.log

Expected result

---------------start of record 1-------------
2013-03-25 08:02:32 Auth.Critical   169.254.8.110   Mar 25 08:02:32 dc3 MSWinEventLog   2   Security    11730159    Mon Mar 25 08:02:29 2013    680 Security    NT AUTHORITY\SYSTEM N/A Audit Failure   dc3 9   Logon attempt by:   MICROSOFT_AUTHENTICATION_PACKAGE_V1_0

Logon account:  USER6Q

Source Workstation: dc3

Error Code: 0xC0000234
---------------end of record 1-------------


---------------start of record 2-------------
2013-03-25 08:02:32 Auth.Critical   169.254.8.110   Mar 25 08:02:32 dc3 MSWinEventLog   2   Security    11730160    Mon Mar 25 08:02:29 2013    539 Security    NT AUTHORITY\SYSTEM N/A Audit Failure   dc3 2   Logon Failure:

    Reason:     Account locked out

    User Name:  USER6Q@MYDOMAIN.TLD

    Domain: MYDOMAIN

    Logon Type: 3

    Logon Process:  Advapi  

    Authentication Package: Negotiate

    Workstation Name:   dc3

    Caller User Name:   dc3$

    Caller Domain:  MYDOMAIN

    Caller Logon ID:    (0x0,0x3E7)

    Caller Process ID: 400

    Transited Services: -

    Source Network Address: 169.254.7.89

    Source Port:    55314
---------------end of record 2-------------

Don't ever use `awk "/start pattern/,/end pattern/" logfile`. It makes trivial stuff slightly briefer but you can't expand it to work for non-trivial stuff. If you posted some sample input (I'm assuming you have actual timestamps in your file rather than the word "timestamp") and expected output that would help. There is a simple awk solution. — Ed Morton, Mar 26 '13 at 02:17

Ed Morton · Accepted Answer · 2013-03-26T17:32:49.380

1

This is all you need with GNU awk (for IGNORECASE):

$ cat tst.awk
function prtRecord() {
    if (record ~ regexp) {
        printf "-------- start of record %d --------%s", ++numRecords, ORS
        printf "%s", record
        printf "--------- end of record %d ---------%s%s", numRecords, ORS, ORS
    }
    record = ""
}
BEGIN{ IGNORECASE=1 }
/^[[:digit:]]+-[[:digit:]]+-[[:digit:]]+/ { prtRecord() }
{ record = record $0 ORS }
END { prtRecord() }

or with any awk:

$ cat tst.awk
function prtRecord() {
    if (tolower(record) ~ tolower(regexp)) {
        printf "-------- start of record %d --------%s", ++numRecords, ORS
        printf "%s", record
        printf "--------- end of record %d ---------%s%s", numRecords, ORS, ORS
    }
    record = ""
}
/^[[:digit:]]+-[[:digit:]]+-[[:digit:]]+/ { prtRecord() }
{ record = record $0 ORS }
END { prtRecord() }

Either way you'd run it on UNIX as:

$ awk -v regexp=user6q -f tst.awk file

I don't know the Windows syntax but I expect it's very similar if not identical.

Note the use of tolower() in the script to make both sides of the comparison lower case so the match is case-insensitive. If you can instead pass in a search regexp that's the correct case, then you don't need to call tolower() on either side of the comparison. nbd, it might just speed the script up slightly.

$ awk -v regexp=user6q -f tst.awk file
-------- start of record 1 --------
2013-03-25 08:02:32 Auth.Critical   169.254.8.110   Mar 25 08:02:32 dc3 MSWinEventLog   2   Security
    11730159    Mon Mar 25 08:02:29 2013    680 Security    NT AUTHORITY\SYSTEM N/A Audit Failure
dc3 9   Logon attempt by:   MICROSOFT_AUTHENTICATION_PACKAGE_V1_0

Logon account:  USER6Q

Source Workstation: dc3

Error Code: 0xC0000234
--------- end of record 1 ---------

-------- start of record 2 --------
2013-03-25 08:02:32 Auth.Critical   169.254.8.110   Mar 25 08:02:32 dc3 MSWinEventLog   2   Security
    11730160    Mon Mar 25 08:02:29 2013    539 Security    NT AUTHORITY\SYSTEM N/A Audit Failure
dc3 2   Logon Failure:

    Reason:     Account locked out

    User Name:  USER6Q@MYDOMAIN.TLD

    Domain: MYDOMAIN

    Logon Type: 3

    Logon Process:  Advapi

    Authentication Package: Negotiate

    Workstation Name:   dc3

    Caller User Name:   dc3$

    Caller Domain:  MYDOMAIN

    Caller Logon ID:    (0x0,0x3E7)

    Caller Process ID: 400

    Transited Services: -

    Source Network Address: 169.254.7.89

    Source Port:    55314
--------- end of record 2 ---------

edited Mar 26 '13 at 17:32

answered Mar 26 '13 at 02:29

Ed Morton

188,023
17
78
185

This is sort of what I was expecting, but in practice it's very slow. I think I'd like to stick with my idea of using `grep` with context, then trimming the fat from around the middle, as `grep` seems able to find all matches in 100 meg file in just a few seconds. However, `awk` seems to be echoing out everything it received via stdin from `grep`. I wonder whether gnuwin32 `awk` doesn't behave the same as POSIX. The only things I changed in `tst.awk` were adding the timestamp regexp (`/^\d\d\d\d-\d\d-\d\d \d\d:\d\d:\d\d/` instead of `/timestamp/`), and added `IGNORECASE = 1` to `prtRecord()` – rojo Mar 26 '13 at 14:16
\d is a shorthand that only some tools will understand. You were probably building up a single record of the contents of the whole file and that's why it was slow (string concatenation is slower than input/output in awk and the bigger the strings the slower it is). Use POSIX character classes like [[:digit:]], see my script. I don't believe that the awk script I posted would be slow. – Ed Morton Mar 26 '13 at 14:18
IGNORECASE is a GNU awk extension, not available in POSIX awks. If you have GNU awk (`awk --version` will tell you) then you should assign IGNORECASE=1 in the BEGIN section rather than in the function and get rid of the tolower()s. I'll update my script to show that alternative. – Ed Morton Mar 26 '13 at 14:24
You're a pro. Look at me.... Pro. You're absolutely correct that, when I don't mess up the syntax, `awk` is fast. Thanks especially for the tip about [POSIX bracket expressions](http://www.regular-expressions.info/posixbrackets.html). But I wonder whether there's any way to simulate `\b` around the search term? For example, we have a user account in our organization named `A`. That's it. Just `A`. With pcre I was able to match `/:\s+\bA\b(@domain.tld)?$/` to ensure that I ended up with exactly the records for which I searched. Can that be reasonably translated to POSIX bracket notation? – rojo Mar 26 '13 at 14:49
\b is a backspace character in awk (I know, it's different in other tools). \B is the awk equivalent and GNU awk would support that but POSIX awks would not. There's also \< and \> and some other alternatives - see http://www.gnu.org/software/gawk/manual/gawk.html#Escape-Sequences. \s is [[:space:]] in POSIX but gawk would recognize \s too. There isn't a POSIX character class equivalent to \B, you'd have to consider using negated classes like [^[:alnum:]_]. If I were you I would use the gawk extended functionality when POSIX has no equivalent. – Ed Morton Mar 26 '13 at 15:01
I'm curious - did you see any difference in speed when using the IGNORECASE version vs the tolower() version of the script? I just wonder if IGNORECASE helps or hurts and by how much. – Ed Morton Mar 26 '13 at 15:03
Testing with a log file containing 2,902,930 lines, the search completes in about ten seconds regardless of whether I have `IGNORECASE` in the `BEGIN` section, in the `prtRecord()` function, I use `tolower()`, or I use none of the above and perform a case-sensitive search. The difference seems to be negligible. For what it's worth, I think I fine tuned my match well enough by adding `regexp=":[[:space:]]+"regexp"[^[:alnum:]]"` to the `BEGIN` section. I'll have to test it some more, but I think I'm on the right track now. – rojo Mar 26 '13 at 15:51
I feel silly for asking, but how can I add a couple of extra blank lines at the end of the `--- end of record N ---` line? I tried adding more `\n`'s at the end of the dashes, `print "\n\n"`, `printf "%s\n", "\n\n"`, setting `ORS = "\n\n\n"` within the `BEGIN` section... I'm as helpless with `printf` as I am with `awk` apparently. – rojo Mar 26 '13 at 17:20
ORS is printed at the end of each `print` command. I never used `print` in my script so setting ORS would have no effect. Adding some `\n`s at the end of the dashes should be all you need to do by I wonder if, this being Windows, you need \r\n or something to produce newlines. To be safe you could print ORSs and those will hopefully be whatever you need to produce newlines in your OS - I updated my script above to show that. – Ed Morton Mar 26 '13 at 17:31
Nope, but `\r\n` worked, sort of. Weird, because I can do new lines without carriage returns in JScript. For instance, `WSH.Echo('hello\nworld\n!')` results in three lines of output without any `\r`'s. Anyway, I say `\r\n` *sort of* worked because adding `\r\n`'s to the `printf` lines increased execution time from about 10 seconds to about a minute and a half. Meh. Screw it. I'll add the extra between-record line breaks with batch scripting. C'est la vie. *shrug* Anyway, thanks for everything! My log reader is finished now. – rojo Mar 26 '13 at 19:11
Hmm, it doesn't make sense that ORS would not produce the desired newlines when used printing the lines, but DOES produce the desired newlines when being appended to each line of the constructed record `{ record = record $0 ORS }`. Maybe the "\r"s are already present which would imply that you should really use `BEGIN{ RS=ORS="\r\n"}`. You could print RS and/or ORS and/or a single $0 to see. Anyway, glad it's working for you. – Ed Morton Mar 26 '13 at 22:21

Magoo · Answer 2 · 2013-03-26T05:19:55.643

Here's my effort:

@ECHO OFF
SETLOCAL
::
:: Target username
::
SET target=%1
CALL :zaplines
SET count=0
FOR /f "delims=" %%I IN (rojoslog.txt) DO (
  ECHO.%%I| findstr /r "^20[0-9][0-9]-[0-9][0-9]-[0-9][0-9].[0-9][0-9]:[0-9][0-9]:[0-9][0-9]" >NUL
  IF NOT ERRORLEVEL 1 (
    IF DEFINED founduser CALL :report
    CALL :zaplines
  )
  (SET stored=)
  FOR /l %%L IN (1000,1,1200) DO IF NOT DEFINED stored IF NOT DEFINED line%%L (
    SET line%%L=%%I
    SET stored=Y
   )
  ECHO.%%I|FINDSTR /b /e /i /c:"account name: %target%" >NUL
  IF NOT ERRORLEVEL 1 (SET founduser=Y)
)
IF DEFINED founduser CALL :report
GOTO :eof

::
:: remove all envvars starting 'line'
:: Set 'not found user' at same time
::
 :zaplines
(SET founduser=)
FOR /f "delims==" %%L IN ('set line 2^>nul') DO (SET %%L=)
GOTO :eof

:report
IF NOT DEFINED line1000 GOTO :EOF 
SET /a count+=1
ECHO.
ECHO.---------- START of record %count% ----------
FOR /l %%L IN (1000,1,1200) DO IF DEFINED line%%L CALL ECHO.%%line%%L%%
ECHO.----------- END of record %count% -----------
GOTO :eof

Thanks Peter, and a good effort it is. Unfortunately, it's too slow for me to use. I initiated the script about 20 minutes ago. Since then, both cores of my CPU have been bouncing between 70 - 100%, but the script is still trying to loop through the first log file. I don't think I'll be able to use a pure batch solution. — rojo, Mar 26 '13 at 12:55

Aacini · Answer 3 · 2013-03-26T18:34:10.407

Below there is a pure Batch solution that does not use grep. It locates timestamp lines because the "summary" word that must not exist in other lines, but this word may be changed for another one if needed.

EDIT: I changed the word that identify timestamp lines to "Auth."; I also changed FINDSTR seek to ignore case. This is the new version:

@echo off
setlocal EnableDelayedExpansion

:parselog <username> <logfile>
echo Searching %~2 for records containing %~1...

set n=0
set previousMatch=Auth.
for /F "tokens=1* delims=:" %%a in ('findstr /I /N "Auth\. %~1" %2') do (
   set currentMatch=%%b
   if "!previousMatch:Auth.=!" neq "!previousMatch!" (
      if "!currentMatch:Auth.=!" equ "!currentMatch!" (
         set /A n+=1
         set /A skip[!n!]=!previousLine!-1
      )
   ) else (
      set /A end[!n!]=%%a-1
   )
   set previousLine=%%a
   set previousMatch=%%b
)
if %n% equ 0 (
   echo No records found
   goto :EOF
)

if not defined end[%n%] set end[%n%]=-1
set i=1
:nextRecord
   echo/
   echo ---------------start of record %i%-------------
   if !skip[%i%]! equ 0 (
      set skip=
   ) else (
      set skip=skip=!skip[%i%]!
   )
   set end=!end[%i%]!
   for /F "%skip% tokens=1* delims=:" %%a in ('findstr /N "^" %2') do (
      echo(%%b
      if %%a equ %end% goto endOfRecord
   )
   :endOfRecord
   echo ---------------end of record %i%-------------
   set /A i+=1
if %i% leq %n% goto nextRecord

Example command:

C:>test user6q catch-all.log

Result:

Searching catch-all.log for records containing user6q...

---------------start of record 1-------------
2013-03-25 08:02:32 Auth.Critical   169.254.8.110   Mar 25 08:02:32 dc3 MSWinEventLog   2   Security    11730159    Mon Mar 25 08:02:29 2013    680 Security    NT AUTHORITY\SYSTEM N/A Audit Failure   dc3 9   Logon attempt by:   MICROSOFT_AUTHENTICATION_PACKAGE_V1_0

Logon account:  USER6Q

Source Workstation: dc3

Error Code: 0xC0000234
---------------end of record 1-------------

---------------start of record 2-------------
2013-03-25 08:02:32 Auth.Critical   169.254.8.110   Mar 25 08:02:32 dc3 MSWinEventLog   2   Security    11730160    Mon Mar 25 08:02:29 2013    539 Security    NT AUTHORITY\SYSTEM N/A Audit Failure   dc3 2   Logon Failure:

    Reason:     Account locked out

    User Name:  USER6Q@MYDOMAIN.TLD

    Domain: MYDOMAIN

    Logon Type: 3

    Logon Process:  Advapi  

    Authentication Package: Negotiate

    Workstation Name:   dc3

    Caller User Name:   dc3$

    Caller Domain:  MYDOMAIN

    Caller Logon ID:    (0x0,0x3E7)

    Caller Process ID: 400

    Transited Services: -

    Source Network Address: 169.254.7.89

    Source Port:    55314
---------------end of record 2-------------

This method use just one execution of findstr command to locate all matching records, and then one additional findstr command to show each record. Note that first for /F ... command works over findstr "Auth. user.." results, and the second for /F command have a "skip=N" option and a GOTO that break the loop as soon as the record was displayed. This mean that FOR commands does not slow down the program; the speed of this program depends on the speed of FINDSTR command.

However, it is possible that the second for /F "%skip% ... in ('findstr /N "^" %2') command take too long because the size of FINDSTR output result before it is processed by the FOR. If this happen, we could modify the second FOR by another faster method (an asynchronous pipe that will be break, for example). Please, report the result.

Antonio

+1, very nice work, very fast with empty lines from the records but no exclamation marks in the output ( DelayedExpansion). I like the 'skip' trick. — Endoro, Mar 26 '13 at 09:08
I very much appreciate the work you put into this. The summary doesn't actually contain the word "summary" (which isn't that big a deal, as I can `findstr /n "Auth."`); but the timestamp / summary line will not include the account name, which is kind of a bigger deal. — rojo, Mar 26 '13 at 12:28
I've been leaning toward using a binary such as `awk` or `grep` or similar to parse the files, as in my experience, batch `for` loops are drastically slower. At first, I also tried a JScript `textfile.ReadAll()`; but that took forever and a day on my 100+ meg, 2.5 million lines per hour log files as well. But I'm curious to see whether your method, if fixed, is more efficient than what I've attempted in the past. — rojo, Mar 26 '13 at 12:29
@rojo: See the new version of my program. Please, be aware that there are still two or three modifications that could speed up my method with very large files! — Aacini, Mar 26 '13 at 18:41
@Aacini - test file = 2,627,399 lines @ 94,331KB. Ed Morton's [awk method](http://stackoverflow.com/a/15628309/1683264) = 10 seconds to find 25 records. Your scripted batch loop = 15 minutes and counting, but I'm going to have to `^C` it. Thank you for satisfying my curiosity though! You're a scholar and a gentleman. — rojo, Mar 26 '13 at 19:28
@rojo: Did my program displayed "start of record 1" or not? If yes, then I still have hope for speed it up! — Aacini, Mar 26 '13 at 19:35
@Aacini - I'm afraid not. It never progressed beyond the first `for /f` loop as far as I could see. — rojo, Mar 26 '13 at 19:37
@rojo: May I ask you a simple request? Could you execute `findstr /I /N "Auth\. userXY" theBIGfile.log` from the command prompt and write down the timing? If it is more than 15 minutes, then the speed problem is in FINDSTR itself and we can not do anything about it! — Aacini, Mar 26 '13 at 20:04
When allowed to output to the console, the `findstr` command takes 4 minutes 43 seconds. With the output redirected to `>NUL` the command takes 10.7 seconds. Regardless of output, your `findstr` command will cause your `for /f` loop to loop 2.5+ million times, depending on the size of the log file. Your `findstr` command is shorthand for `findstr /i /n /r /c:"Auth\." /c:"userXY"` which will return *every* record's first line, regardless of whether the record contains `userXY` or not. At 0.25ms per loop iteration, that's still a little over 5 minutes from start to finish if my math is correct. — rojo, Mar 27 '13 at 14:19
... and your `for /f` loop is decidedly not 0.25ms per loop. Regardless of how much additional fat you can trim from your already anorexic loop, your pure batch solution cannot compete with the speed of a binary designed for this sort of text parsing. It was a valiant effort, though. — rojo, Mar 27 '13 at 14:26

score 0 · Answer 4 · answered Mar 26 '13 at 02:10

0

I think awk is all you need:

awk "/---start of record---/,/---end of record---/ {print}" logfile

That's all you need if the first line indicator is:

---start of record---

and the last is:

---end of record---

Notice that there is no middle-pattern matching, that "," is just a separator for both regexps.

answered Mar 26 '13 at 02:10

Daniel

21,933
14
72
101

A wise man once said that love is all you need. In this case, though, I think you mostly copied the `awk` line I had in my question, disregarding the rest. – rojo Mar 26 '13 at 13:03

regexp match within a log file, return dynamic content above and below match

4 Answers4

Linked