Bash script to extract information from a block of text spanning multiple lines

Question

I am trying to extract track information from MKV files using mkvinfo from a bash script. The output is a long series of lines with repeating patterns as delimiters for various track properties of various track types. An example of a track is:

…
| + A track
|  + Track number: 6 (track ID for mkvmerge & mkvextract: 5)
|  + Track UID: 11555278830806058806
|  + Track type: subtitles
|  + (Unknown element: TrickTrackFlag; ID: 0xc6 size: 3)
|  + Enabled: 1
|  + Default flag: 0
|  + Forced flag: 0
|  + Lacing flag: 0
|  + MinCache: 0
|  + Timecode scale: 1
|  + Name: Spanish
|  + Language: spa
|  + Codec ID: S_TEXT/UTF8
|  + (Unknown element: TrackAttachmentLink; ID: 0x7446 size: 11)
|  + Codec decode all: 1
| + A track
|  + Track number: 7 (track ID for mkvmerge & mkvextract: 6)
…

There can be multiple instances of a given track type and the number of lines for a track is somewhat variable. I need to extract certain track properties from specific track types. For example, if I want to find all instances of the subtitles track type and extract the Track number and the Codec ID, I can pipe the results through grep:

mkvinfo "file.mkv" | grep "subtitles" -B 2 | grep "Track number"

This outputs the lines containing the track numbers for all subtitle tracks. I have to put the lines into an array and filter them to get the first number so I can use it with mkvpropedit, which requires the first number.

Similarly:

mkvinfo "file.mkv" | grep "subtitles" -A 10 | grep "Codec ID: " | sed 's/^.**: //'

outputs the codec IDs for all subtitle tracks.

This works fine IF I know exactly how many lines there are before/after the line containing subtitles. The problem is, the exact number of lines to include varies from file to file. So what I need to do is to output the entire block of lines between | + A track and a line beginning with |+ OR | + OR EOF. I also need to filter the block to extract the first Track number and the Codec ID. I tried using | grep -Eo [0-9]+ | head -1 to extract the first number of each track but it only works on the first track found and quits. If there's a way to make it work for all tracks in one line that would be helpful. The second example I gave using sed works for the Codec ID.

The bottom line QUESTION is:

How can I extract specific properties of specific track types, such as the example given, and put them into an array or arrays for further processing?

I am hoping to be able to meet the following criteria:

I want to use existing bash (GNU bash, version 4.3.30(1)-release (x86_64-apple-darwin12.5.0)) utilities like sed, awk, grep, …
I don't want to have to create an 'intermediate file'
I want to simply pipe the output of mkvinfo into the various utilities

I found lots of threads that show how to use sed to find a block of text between two words but I could not get the code to work with entire lines or strings containing spaces. Maybe there is a way to do that but I don't know enough about sed to be able to adapt the code to my situation.

Please explain in detail how your code works so I can 'learn how to fish' so next time I can do it myself.

fferri · Accepted Answer · 2015-04-27T18:31:03.750

2

When processing multiple lines in complex ways, my tool of choice is awk.

In each matching pattern, we save the match in a variable. Finally, when we encounter the string indicating a new block (| + A track), or we reach the end of the stream, we print the value of the variables we are interested in (track number, codec id), but only if the type is subtitles.

mkvinfo ... | gawk '
    match($0, /Track number: ([0-9]+)/, m) {TN=m[1]}
    match($0, /Codec ID: (.*)$/, m)        {CI=m[1]}
    /Track type: subtitles/                {SUB=1}
    /^\| \+ A track$/ {if(SUB) print TN, CI; unset SUB}
    END               {if(SUB) print TN, CI; unset SUB}'

You need gawk to have the match function to capture parenthesized groups.

edited Apr 27 '15 at 18:31

answered Apr 27 '15 at 09:24

fferri

18,285
5
46
95

That works superbly. Thanks a lot! Now I have to study the code to understand how it works, for future reference. I definitely will use `awk`|`gawk` in the future. – hmj6jmh Apr 27 '15 at 15:46
This is just a bit less common that the classical awk oneliner you can find around, where you have `awk '/pattern/ {print $something}'`, but the functioning is exactly the same. We need match due to the irregular nature of the input (different separators across lines) – fferri Apr 27 '15 at 15:54
I notice that the first two `match` statements return the entire match in `m[0]` and just the regex after the space in `m[1]`. How would you put both track numbers into the array? Would that require a separate `match`? Also what does the ending `/^\| \+ A track$/` statement do? Does it insure that the previous matches came after an `| + A track` line? How does that work? – hmj6jmh Apr 27 '15 at 18:21
really, `match($0, /pattern/) {...}` is equivalent to `/pattern/ {...}` – fferri Apr 27 '15 at 18:27
the pattern `/^\| \+ A track$/` matches the beginning of a new block (track). if you have seen a track before (you known that because SUB is 1), you print the interesting variables. then you unset SUB. – fferri Apr 27 '15 at 18:29
I am confused about the order of the matches. The output of `mkvinfo` is `A track`, `Track number`, `Track type`, `Codec ID` but the order of the matches is different with the `A track` one being last. How do all these matches get tied to the same record? – hmj6jmh Apr 27 '15 at 20:10
each line is checked among all patterns. in this particular case, since your information spans multiple lines, the order in which patterns are specified is not relevant. but there are some cases in which the order of patterns can make a difference. – fferri Apr 27 '15 at 20:14
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/76390/discussion-between-hmj6jmh-and-mescalinum). – hmj6jmh Apr 27 '15 at 20:47

Bash script to extract information from a block of text spanning multiple lines

1 Answers1

Linked