0

I have the following PCRE2 regex that works to match and remove timestamp lines in a .webVTT subtitle file (the default for YouTube):

^[0-9].:[0-9].:[0-9].+$

This changes this:

00:00:00.126 --> 00:00:10.058
How are you today?

00:00:10.309 --> 00:00:19.272
Not bad, you?

00:00:19.559 --> 00:00:29.365
Been better.

To this:

How are you today?

Not bad, you?

Been better.

How would I convert this PCRE2 regex to an idiomatic (read: sane-looking) equivalent for sed's flavour of regex?

Hashim Aziz
  • 4,074
  • 5
  • 38
  • 68

2 Answers2

1

Using your regex with sed

$ sed -En '/^[0-9].:[0-9].:[0-9].+$/!p' file
How are you today?

Not bad, you?

Been better.

Or, do not match lines that end with an integer

$ sed  -n '/[0-9]$/!p' file
How are you today?

Not bad, you?

Been better.
HatLess
  • 10,622
  • 5
  • 14
  • 32
  • The first one gives me a "no such file or directory" error when I try and use it in the following command: `sed -e '1,4d' -En '/^[0-9].:[0-9].:[0-9].+$/!p' "input file.vtt" > test.txt`. – Hashim Aziz May 13 '22 at 23:22
  • I replaced it to make the command clearer - using tab completion for the file on my shell still results in the error. – Hashim Aziz May 13 '22 at 23:25
  • Could it be something to do with trying to combine `-e` and `-E`? I was previously using two invocations of `-e` in the same `sed` command but maybe `-E` is not a drop-in replacement? – Hashim Aziz May 13 '22 at 23:26
  • This is the full command that gives me the "no such file or directory" error: `sed -e '1,4d' -En '/^[0-9].:[0-9].:[0-9].+$/!p' "my file.vtt" > test.txt`. – Hashim Aziz May 13 '22 at 23:28
  • @HashimAziz Does this work `sed -En '1,4d;/^[0-9].:[0-9].:[0-9].+$/!p' "my file.vtt" > test.txt` – HatLess May 13 '22 at 23:33
  • 1
    Yes, that works perfectly, as does adding the spaces back between `/ ! p`, so I suppose I was right in my hunch that -e and -E can't be combined, which you solved by integrating the first command into the -E. – Hashim Aziz May 13 '22 at 23:38
1

Your pattern is not a specific PCRE2 pattern, only using sed you have to escape the \+ to make it a quantifier for 1 or more times.

At the positions that you use a dot to match any character (and looking at the example data) there is a digit as well.

You could make the pattern a bit more specific, and omit the quantifier at all. Just prevent the line from printing if the pattern matches.

sed -n '/^[0-9][0-9]:[0-9][0-9]:[0-9]/!p' file
  • -n prevents the default printing in sed
  • !p prints the line if the pattern does not match

Output

How are you today?

Not bad, you?

Been better.
The fourth bird
  • 154,723
  • 16
  • 55
  • 70