1

This is my first post on stackoverflow. \0/ I hope it's not too long of an entry. I'm writing a BASH script to regularly read, filter and output data from thousands of logfiles. Performance is important, so that's why I'm mainly using grep instead of awk or sed.

grep -Poz does exactly what I want in capturing the (multiline)data using patterns that's relevant for further processing, but I'm stuck in manipulating the data to, for example, an XML-file or a SQLite3 batch-query for further analysis.

#!/bin/bash
# Regex:
# (?s) multiline search
# Capturegroup 1 = date
# Capturegroup 2 = time
# Capturegroup 3 = error type (ERROR, WARN or DEBUG)
# Capturegroup 4 = error details
# Positive lookahed, until new line (windows/linux) starts with date, OR (if it's the last line matching the pattern, till the end of the last line.
#
REGEX_MULTILINE="(?s)([0-9]{4}-[0-9]{2}-[0-9]{2})[[:space:]]([0-9]{2}:[0-9]{2}:[0-9]{2}[,|.][0-9]{3})[[:space:]]+(ERROR|WARN|DEBUG)(.*?)(?=(?:\r\n|[\r\n])[0-9]{4}-[0-9]{2}-[0-9]{2}|\z)"
LOGFILE="test.log"

# write to logfile gives exactly the info I want
write_log(){
    echo -n $(grep -Pzo $REGEX_MULTILINE $LOGFILE) > output_grep1.txt
}

# I'm stuck in this part to generate, for example, an XML-file
write_xml(){
    local LOGDATE=""
    local LOGTIME=""
    local LOGTYPE=""
    local LOGINFO=""
    while IFS= read -r LINE ; do
    #For testing purposes, to see if brackets contain the full string, 
    #or a line of that string
    printf '%s\n' "[$LINE]"
    #processing logic here. Didn't get this far yet
    while [[ $LINE =~ $REGEX_MULTILINE ]] ; do
        # regex capturegoups
        LOGDATE=${BASH_REMATCH[1]}
        LOGTIME=${BASH_REMATCH[2]}
        LOGTYPE=${BASH_REMATCH[3]}
        LOGINFO=${BASH_REMATCH[4]}
        # send vars to function for output
        # write_xml_function $LOGDATE $LOGTIME $LOGTYPE $LOGINFO
        # for testing purposes
        echo -e "log entry:\n\t 1: $LOGDATE \n\t 2: $LOGTIME \n\t 3: $LOGTYPE \n\t 4: $LOGINFO \n" 
        break
    done
done < <(grep -Pzo $REGEX_MULTILINE $LOGFILE)
}

A logfile may look something like this:

2017-01-01 11:09:42,439 INFO  server.service.function.property.PropertyService - Props (re)loaded.
2017-01-01 11:15:46,155 DEBUG server.service.ApiController - api/start called! params:
${params}
2017-01-01 13:01:29,675 ERROR server.service.util.base.FtpClient - Error retrieving file. Directory does not exist.
2017-01-01 13:15:12,803 DEBUG server.service.ApiController - api/start called! params:
${params}
2017-01-01 13:15:13,932 INFO server.service.ControllerService - Filter:server.service.model.Filters
2017-01-01 15:36:04,914 INFO server.service.ControllerService - Filter:server.service.model.Filters
2017-01-01 15:55:50,279 ERROR server.service.WebClient - server API failed: [(someError.java:12345)]
{"someId":"etc","otherId":123,"token":{}}
2017-01-01 15:55:50,366 ERROR server.service.controller.Search - Server error for [/service/search/load]: java.lang.NullPointerException stack[etc]
java.lang.NullPointerException
    at server.common.stack(SomeApi.java:123)
    at server.service.trace(SomeService.java:456)
    at java.lang.Thread.run(Thread.java:789)
    etc.
    etc.
2017-01-01 16:17:55,175 DEBUG server.config.app - 

STARTING...


2017-01-01 16:18:00,040 INFO  server.common.service.base.property - Props (re)loaded.
2017-01-01 17:44:43,959 DEBUG server.service.controller - api/start called! params:
${params}

The result I expect in reading a grep multiline string is this:

[2017-01-01 13:15:13,932 INFO server.service.ControllerService - Filter:server.service.model.Filters]
[2017-01-01 15:36:04,914 INFO server.service.ControllerService - Filter:server.service.model.Filters]
[2017-01-01 15:55:50,279 ERROR server.service.WebClient - server API failed: [(someError.java:12345)]
{"someId":"etc","otherId":123,"token":{}}]
[2017-01-01 15:55:50,366 ERROR server.service.controller.Search - Server error for [/service/search/load]: java.lang.NullPointerException stack[etc]
java.lang.NullPointerException
    at server.common.stack(SomeApi.java:123)
    at server.service.trace(SomeService.java:456)
    at java.lang.Thread.run(Thread.java:789)
    etc.
    etc.]

Instead I get this:

[2017-01-01 13:15:13,932 INFO server.service.ControllerService - Filter:server.service.model.Filters]
[2017-01-01 15:36:04,914 INFO server.service.ControllerService - Filter:server.service.model.Filters]
[2017-01-01 15:55:50,279 ERROR server.service.WebClient - server API failed: [(someError.java:12345)]
{"someId":"etc","otherId":123,"token":{}}]
[2017-01-01 15:55:50,366 ERROR server.service.controller.Search - Server error for [/service/search/load]: java.lang.NullPointerException stack[etc]]
[java.lang.NullPointerException]
[   at server.common.stack(SomeApi.java:123)]
[   at server.service.trace(SomeService.java:456)]
[   at java.lang.Thread.run(Thread.java:789)]
[   etc.]
[   etc.]

What did I overlook? Can it be done this way?

Charles Duffy
  • 280,126
  • 43
  • 390
  • 441
Asgair
  • 607
  • 5
  • 11
  • `grep` works line-by-line, not on whole files. You'll have to use something else, like Perl or Python. – squirl Jan 24 '17 at 14:39
  • 3
    @Samadi, with `-z`, it is not line-by-line. – Charles Duffy Jan 24 '17 at 15:44
  • 2
    @Asgair, as an aside -- all-caps variable names are specified by POSIX to be used for variables with meaning to the operating system or shell, whereas names with at least one lower-case character are reserved for application use; you should be using names in the latter class. If a future version of your shell adds a new all-caps builtin, this will ensure that nothing you're using overwrites it by accident. – Charles Duffy Jan 24 '17 at 15:45
  • 2
    Also, avoid `echo -e` -- it conflates your literal data with your format string, and isn't defined *at all* by POSIX; a compliant `echo` will actually write `-e` on output, and may (or may not, depending on whether it supports XPG extensions to the standard) interpret backslash escapes out-of-the-box. Even bash is inconsistent in this behavior depending on which flags it was compiled with (`--enable-default-xpg-echo`). – Charles Duffy Jan 24 '17 at 15:48
  • 2
    `printf '\t%s\n' "1: $logDate" "2: $logLine" "3: $logType" "4: $logInfo"` (assuming the variables are appropriately renamed per prior comment) will print your four lines with tab indents and trailing newlines, but *without* attempting to honor any escape sequences which might exist within the literal data. If you wanted to format nonprintable characters legibly, you could consider a format operator other than `%s`, such as `%q`. – Charles Duffy Jan 24 '17 at 15:50
  • 1
    ...`echo -n` is similarly poorly-specified (POSIX simply specifies that behavior is *unspecified* when the `-n` argument is passed), and thus likewise better replaced with `printf`. You've also got a bunch of quoting bugs -- http://shellcheck.net/ will find them. – Charles Duffy Jan 24 '17 at 15:52
  • @CharlesDuffy, oops! I should learn to check manpages more carefully :) – squirl Jan 24 '17 at 15:59
  • @Asgair, ...btw, do you want your output to be NUL-delimited? (Right now, you're dropping the NULs -- depending on your shell, `$(...)` will either terminate what it captures at the first NUL, or silently delete NULs; bash takes the latter approach). – Charles Duffy Jan 24 '17 at 16:03
  • 1
    @CharlesDuffy: Since Bash 4.4, Bash isn't silent anymore! `bash: warning: command substitution: ignored null byte in input`. – gniourf_gniourf Jan 24 '17 at 16:06
  • 2
    @Asgair : Not bad for a first Q on Stackoverflow. The main place to improve is to keep the Q a small and as focused as possible. (hard to do sometimes, so don't sweat it). While the MCVE page is very generic, for shell scripting you can usually have a good Q with 1. small sample set of data (that includes data that shouldn't be processed). 2. required output given that input, 3. your current code, output, error msgs (exact!) 4. you brief thoughts about what else you have tried, and why you like your current approach ;-) Keep posting and Good luck to all. – shellter Jan 24 '17 at 16:31
  • @CharlesDuffy Thanks for the advice on using caps & the use of printf format oprators & other tips! Very helpful. I will change it immediately. Considering NUL's, I'm just trying my options to get the right result. ;-) – Asgair Jan 24 '17 at 17:18
  • @shellter Thanks for the tips. :-) – Asgair Jan 24 '17 at 17:18
  • ...addendum: I should have said "XSI" extensions, earlier -- oops! – Charles Duffy Jan 24 '17 at 17:22
  • @Asgair, ...I'm trying to determine what you *consider* a "right result". If you consider the NULs purely internal content that shouldn't be present in either literal input or literal output, that's something that should be explicitly specified -- by contrast, if you have your input format logging a literal NUL after each multi-line log entry, (1) yay for the foresight! that's making your job now much easier; and (2) it should be explicitly specified. Since NULs don't show up when written to a terminal, showing what your log sample looks like when `cat`ted doesn't show this either way. – Charles Duffy Jan 24 '17 at 17:27
  • @charlesduffy There are no NULs in the log-files, nor do I need it in the output. As I'm new to writing bash scripts (learning as I go), I was considing replacing \N by NULs at the end of each multiline string returned by grep just for the sake of manipulating "while read". Which didn't work out as I expected. – Asgair Jan 24 '17 at 23:20
  • @charlesduffy In the end, I just need to write either a XML-file to generate a report, or create a SQLite3 bulk-query-string. To keep this script as fast as possible I thought to stick with grep using the regex capture-groups in stead of using tools like awk. – Asgair Jan 24 '17 at 23:34
  • @CharlesDuffy, is it common practice to correct the original post? (CAPs, echo) – Asgair Jan 25 '17 at 09:01
  • Only if the correction doesn't impact any preexisting answers. If in doubt, just take notes under advisement for future scripts – Charles Duffy Jan 25 '17 at 14:13
  • BTW, I'm curious as to whether you've benchmarked awk. I'd guess that a unified native awk script would outperform a grep feeding into a bash instance. – Charles Duffy Jan 25 '17 at 14:15

1 Answers1

1

The problem is with your read command. By default, read will read until a newline, but you are trying to process null-separated strings.

You should be able to use

while IFS= read -r -d '' LINE ; do
Grisha Levit
  • 8,194
  • 2
  • 38
  • 53
  • *grumble* re: showcasing an all-caps variable name, contrary to [POSIX guidelines](http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap08.html) (note that since setting a regular shell variable will overwrite a like-named environment variable, naming conventions for one class necessarily apply to both). – Charles Duffy Jan 24 '17 at 15:54
  • I agree, but I think it's best to answer questions using variable names from the code provided. – Grisha Levit Jan 24 '17 at 15:57
  • 1
    There's an argument to be made both ways, definitely -- but inasmuch as StackOverflow answers are providing canonical code to be seen and copied by others (not just the OP asking a question!), there's value in showcasing good practices therein. – Charles Duffy Jan 24 '17 at 15:59
  • @GrishaLevit Thanks for the info, but for some reason it doesn't work. Is grep supposed to end every (multiline)string with a NUL? Can i check it somehow? `test1(){ #this works, whithout defining a delimiter echo start i="0" while IFS= read -r line ; do i=$((i+1)) echo "$i: [$line]" done < <(grep -Pzo $regex_multiline $logfile) echo end } test2(){ #this works not, whith a NUL delimiter echo start i="0" while IFS= read -r -d '' line ; do i=$((i+1)) echo "$i: [$line]" done < <(grep -Pzo $regex_multiline $logfile) echo end }` – Asgair Jan 25 '17 at 09:06
  • Sorry, that comment did not work as I expected. Is there an other way I can add some example-codes to try and pin down where it fails? – Asgair Jan 25 '17 at 09:10
  • grep _does_ supports the -Z or --null flag to output a zero byte (the ASCII NUL character), but `grep -PZzo $regex_multiline $logfile | while IFS= read -r -d '' line ; do` doesn't do the trick either. – Asgair Jan 25 '17 at 09:40
  • ok. so the problem was in the version of grep I used (2.16) which came as default with windows 10's Linux Subsystem I'm currently working with. Relevant posts: [https://stackoverflow.com/questions/15976570/is-there-a-grep-equivalent-for-finds-print0-and-xargss-0-switches](grep-equivalent-for-finds-print0) and [https://stackoverflow.com/questions/31467045/line-regexp-option-with-null-data](line-regexp-option-with-null-data) – Asgair Jan 29 '17 at 15:35