1

As the title states, I want to match the timestamp and text lines of a .srt file subtitles.

some of these files are not formatted properly, so I need something to work with almost all of them.

the correct formatting of a file is like this:

1
00:00:02,160 --> 00:00:04,994
You really don't remember
what happened last year?

2
00:00:06,440 --> 00:00:07,920
- School. Now.
- I dropped out.

3
00:00:08,120 --> 00:00:10,510
- Get your diploma, I'll get mine.
- What you doing?

4
00:00:10,680 --> 00:00:13,514
- Studying.
- You taking your GED? All right, Fi.

and the regex pattern that I came up with is working very well for this kind of files.

as I said, some of the files are not formatted properly, some of them don't have the line number, some of them don't have a new line after each subtitle line and the regex that I came up with does not work properly for those.

There are other questions like this that have already been answered, but I want to match each timestamp and text line in a separate matching-group. so my groups for the first line of the mentioned example would be something like this:

group 1: 00:00:02,160

group 2: 00:00:04,994

group 3: You really don't remember\nwhat happened last year?

this is what I've got so far:

LINE_RE = (
    # group 1:
    r"^\s*(\d+:\d+:\d+,\d+)"  # line starts with any number of whitespace
    # and followed by a time format like 00:00:00,000
    r"(?:\s*-{2,3}>\s*)"  # non-matching group for ' --> '
    # matches one or more of - follwed by a >
    # group 2:
    r"(\d+:\d+:\d+,\d+)\s*\n"  # time format again,
    # ended with any number of whitespace and a \n
    # group 3:
    r"([\s\S]*?(?:^\s*$|\d+:\d+:\d+,\d+|^\s*\d+\s*\n))"
    # matches any character, until it hits an empty line, a line with only a number in it or a timestamp

)

I think my exact problem is in the last non-matching group, it does not work properly when the next line is not an empty line.

this is an example file, I did some mangling in the file so I could show the problem better.

sina.E
  • 28
  • 4

1 Answers1

2

In that case, you can match the lines that start with a timestamp like pattern, and capture all lines that do not start with either a newline and a single digit or another timestamp like pattern.

^\s*(\d+:\d+:\d+,\d+)[^\S\n]+-->[^\S\n]+(\d+:\d+:\d+,\d+)((?:\n(?!\d+:\d+:\d+,\d+\b|\n+\d+$).*)*)

The pattern in parts matches:

  • ^ Start of string
  • \s* Match optional whitspace chars
  • (\d+:\d+:\d+,\d+) Capture group 1, match a timestamp like pattern
  • [^\S\n]+-->[^\S\n]+ Match --> between 1 or more spaces
  • (\d+:\d+:\d+,\d+) Capture group 2, same pattern as for group 1
  • ( Capture group 3
    • (?: Non capture group - \n Match a newline
      • (?! Negative lookahead, assert what is to the right is not
        • \d+:\d+:\d+,\d+\b|\n+\d+$ Match either a timestamp or 1+ newlines followed by only digits
      • ) Close lookahead
      • .* Match the whole line
    • )* Close the non capture group and optionally repeat it
  • ) Close group 3

See a regex demo.

The fourth bird
  • 154,723
  • 16
  • 55
  • 70
  • it's working alright, but sometimes it captures an empty line before the text(group 3), it's not a problem and I can work around it, but since I'm trying to get better at regex, I'm curious why? – sina.E Mar 23 '22 at 10:28
  • @sina.E That is because the repeating group inside capture group 3 here `(?:\n` matching the leading newline inside the repeating – The fourth bird Mar 23 '22 at 10:31
  • 1
    @sina.E You might also use a pattern like https://regex101.com/r/c3DQWs/1 – The fourth bird Mar 23 '22 at 10:34
  • Thanks a lot for your answer, since most of the files that I'm working with are badly formatted, I need to mind as many edge cases as possible. – sina.E Mar 23 '22 at 10:39