Can I sort with context in bash?

Question

When I want to merge log files, I often use cat logA.log logB.log | sort. As long as the log lines start with some timestamp-like string in a common format, that's fine.

But can I somehow sort the lines and keep lines that do(n't) follow a certain rule glued to their original leading line? Just think of a log file where somebody logged something with linebreaks in it (without me knowing that)!

(berta.log)
2021-10-01 00:00:10 Hey!
2021-10-01 00:00:11 How are you doing, Adam?

(caesar.log)
2021-10-01 00:00:00 Hey Berta
2021-10-01 00:00:20 Error: SomebodyCalledMeWithTheWrongNameException: I am not Adam.
    at Conversation.parseStatement
    at Conversation.considerReplyToStatement
    at Conversation.doConversation
2021-10-01 00:00:40 I am not Adam, I am Caesar!

These two log files of course would become unusable if merged with cat berta.log caesar.log | sort.

^{I also am really unsure if I should post this question to StackOverflow or to Superuser or even to Unix or ServerFault...}

Edit for clarity

The merged logs should look e.g. like this:

2021-10-01 00:00:00 Hey Berta
2021-10-01 00:00:10 Hey!
2021-10-01 00:00:11 How are you doing, Adam?
2021-10-01 00:00:20 Error: SomebodyCalledMeWithTheWrongNameException: I am not Adam.
    at Conversation.parseStatement
    at Conversation.considerReplyToStatement
    at Conversation.doConversation
2021-10-01 00:00:40 I am not Adam, I am Caesar!

How do you want to sort the presented input? What should be the output? What is the sorting key? `glued to their original leading line?` What is a "leading line" and which one is it? I do not understand - just `sort logA.log ; sort logB.log` if you want to sort one file at a time? — KamilCuk, Oct 01 '21 at 10:39
@KamilCuk, I've edited the question. The sorting key is the timestamp at the beginning of the lines. The "leading line" in this example is the `00:00:20` line that came from a string that contains a stacktrace and (thus) linebreaks. — Bowi, Oct 01 '21 at 11:03

score 4 · Accepted Answer · answered Oct 01 '21 at 11:12

Classic problem of mixing lines and files.

A solution: Put your multiline log lines on one line

Executable script: ./onelinelog.awk

#! /usr/bin/awk -f

# Timestamp line
/^[0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9] [0-9][0-9]:[0-9][0-9]:[0-9][0-9] / {
    if (log_line != "") { print log_line }
    log_line = $0
    next
}
# Other line
{
    # Here, I use '§' for separate each original lines
    log_line = log_line "§" $0
}
# End of file
END {
    if (log_line != "") { print log_line }
}

Test on caesar.log file:

$ ./onelinelog.awk caesar.log 
2021-10-01 00:00:00 Hey Berta
2021-10-01 00:00:20 Error: SomebodyCalledMeWithTheWrongNameException: I am not Adam.§    at Conversation.parseStatement§    at Conversation.considerReplyToStatement§    at Conversation.doConversation
2021-10-01 00:00:40 I am not Adam, I am Caesar!

Sort:

cat <(./onelinelog.awk caesar.log) <(./onelinelog.awk berta.log) | sort

or

sort <(./onelinelog.awk caesar.log) <(./onelinelog.awk berta.log)

Output:

2021-10-01 00:00:00 Hey Berta
2021-10-01 00:00:10 Hey!
2021-10-01 00:00:11 How are you doing, Adam?
2021-10-01 00:00:20 Error: SomebodyCalledMeWithTheWrongNameException: I am not Adam.§    at Conversation.parseStatement§    at Conversation.considerReplyToStatement§    at Conversation.doConversation
2021-10-01 00:00:40 I am not Adam, I am Caesar!

Fun ?

You may want to recover your original lines...

Use sed:

$ cat and/or sort ... | sed -e 's/§/\n/g'

or another executable awk script: ./tomultilinelog.awk

#! /usr/bin/awk -f
BEGIN {
    FS="§"
}
{
    for (i = 1; i <= NF; i += 1) { print $i }
}

So execute:

$ cat <(./onelinelog.awk caesar.log) <(./onelinelog.awk berta.log) | sort | ./tomultilinelog.awk 
2021-10-01 00:00:00 Hey Berta
2021-10-01 00:00:10 Hey!
2021-10-01 00:00:11 How are you doing, Adam?
2021-10-01 00:00:20 Error: SomebodyCalledMeWithTheWrongNameException: I am not Adam.
    at Conversation.parseStatement
    at Conversation.considerReplyToStatement
    at Conversation.doConversation
2021-10-01 00:00:40 I am not Adam, I am Caesar!

Of course, you could adapt the code and replace '§' character with another token.

This is a great solution which I like much more than my own. :-) — Bowi, Oct 01 '21 at 11:35
Just as an extension about the `§`: As far as I understand it, it doesn't have to be a single character, so one can choose any string that is unlikely to occur in the logs (I always recommend e.g. *PUMUCKL* (the kobold)). — Bowi, Oct 01 '21 at 11:56
I agree with you Bowi. I use in my example a 'single' char but, like in a mail content source, it's better to use a boundary token — Arnaud Valmary, Oct 01 '21 at 13:32

score 0 · Answer 2 · answered Oct 01 '21 at 11:34

I've come up with another awk solution while Arnaud Valmary posted his one.

In my attempt, I just prefixed all lines that do not start with a timestamp with the last timestamp (and a number):

prefixAllLines.awk

#! /usr/bin/awk -f

BEGIN { 
    linePattern="^([0-9]{4}-[0-9]{2}-[0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2}) (.*)" 
}
{ 
    if ($0~linePattern){
        number=0
        linePrefix=gensub(linePattern, "\\1", "g", $0)
        lineRest=gensub(linePattern, "\\2", "g", $0)
        printf linePrefix " " 
        printf ("%03d", number)
        printf " " lineRest "\n"
    } else {
        number+=1
        printf linePrefix " " 
        printf ("%03d", number)
        printf " " $0 "\n"
    }
}

So, ./prefixAllLines.awk caesar.log brings:

2021-10-01 00:00:00 000 Hey Berta
2021-10-01 00:00:20 000 Error: SomebodyCalledMeWithTheWrongNameException: I am not Adam.
2021-10-01 00:00:20 001         at Conversation.parseStatement
2021-10-01 00:00:20 002         at Conversation.considerReplyToStatement
2021-10-01 00:00:20 003         at Conversation.doConversation
2021-10-01 00:00:40 000 I am not Adam, I am Caesar!

And cat <(./prefixAllLines.awk caesar.log) <(./prefixAllLines.awk berta.log) | sort:

2021-10-01 00:00:00 000 Hey Berta
2021-10-01 00:00:10 000 Hey!
2021-10-01 00:00:11 000 How are you doing, Adam?
2021-10-01 00:00:20 000 Error: SomebodyCalledMeWithTheWrongNameException: I am not Adam.
2021-10-01 00:00:20 001         at Conversation.parseStatement
2021-10-01 00:00:20 002         at Conversation.considerReplyToStatement
2021-10-01 00:00:20 003         at Conversation.doConversation
2021-10-01 00:00:40 000 I am not Adam, I am Caesar!

But I like Arnaud Valmary's approach much more. :-)

Interesting solution. But if you have two or more identical timestamps in the same file or in the two files, this solution mixe all contents with the same prefix — Arnaud Valmary, Oct 01 '21 at 14:10
That's right, and that's one reason why I like your approach more =) — Bowi, Oct 01 '21 at 15:42

score 0 · Answer 3 · answered Oct 12 '22 at 05:45

Super Speedy Syslog Searcher can read multiple log files then print multi-line log messages from those files. It will print the log messages sorted by datetime.

(assuming you have rust installed)

cargo install super_speedy_syslog_searcher

then

s4 logA.log logB.log

Can I sort with context in bash?

3 Answers3