-1

I'm using SubtitleEdit and I'd like to locate all the lines that do not contain a line break.

Because lines containing a line break indicates they are bilingual, which I want.

But those that do not have line breaks are mono-lingual, and I'd like to quickly locate them all and delete them. TIA!

Alternatively, if there is a regex expression that can find lines which do not contain any English characters, that would also work.

Joe
  • 27
  • 6
  • What do you mean by 'line break'? `\r\n` sequence or html `
    ` or what?
    – Poul Bak Oct 25 '22 at 12:01
  • Sorry should've clarified, `
    `
    – Joe Oct 25 '22 at 12:04
  • Give some examples of input and what should match and what should not match. How do you define 'a line'? – Poul Bak Oct 25 '22 at 15:05
  • Please take a look at the screenshot in my comment below, thank you. – Joe Oct 25 '22 at 17:25
  • Come on, look at the text, it contains `
    ` NOT `
    `! The space and the slash do of course also matter.
    – Poul Bak Oct 25 '22 at 19:50
  • I don't think it does though, because what it shows there is artificial: https://i.imgur.com/KlckO6T.png I found out that it doesn't even matter what I put in between the < >, the result is the same ➜ https://i.imgur.com/VkXsDom.png ➜ https://i.imgur.com/N8iS1oB.png – Joe Oct 26 '22 at 04:09
  • There is some info here about new line in SubtitleEdit, but it's a lil beyond me... https://github.com/SubtitleEdit/subtitleedit/issues/3221 – Joe Oct 26 '22 at 04:19
  • Ok, weird newline handling by that editor, but having read that Github article, try: `(?=.*\r?\n)` – Poul Bak Oct 26 '22 at 11:20
  • Sorry, I meant: `(?!.*\r?\n).*` – Poul Bak Oct 26 '22 at 16:31
  • That's actually pretty close. It removes Line 2 which is great. But it's also removing the part of Line 1 that is after the newline. https://i.imgur.com/g3BHAsH.png Basically I'm wanting to remove all the lines which are monolingual. And keep the bilingual lines as they are. Since all the bilingual lines have newline, that's what is differentiating them. – Joe Oct 26 '22 at 19:11
  • BTW, is there a regex expression that can find lines which do not contain any English characters? That would also work to quickly identify all the monolingual lines (assuming the monolingual lines are in Chinese). – Joe Oct 26 '22 at 19:12

2 Answers2

1

You should use regex assert. Given test lines:

something_1
some<br>thing_2
something_3<br>
<br>something_4
something_5

This is an expression that will match lines 1 and 5

^(?!.*<br>).*$

In this regular expression we have the negative lookahead assertion (?!.*<br>) that allows us to define what line is suitable for us

Vitalii
  • 19
  • 3
  • I tried using ^(?!.*
    ).*$ but it finds everything that is not
    I only want it to find the lines which don't contain
    A screenshot to make it clearer what I mean: https://i.imgur.com/XsD72Se.png So here I need it to find LINE 4 and replace it with nothing. But instead it's sort of doing the opposite.
    – Joe Oct 25 '22 at 17:17
  • Let me correct myself, it IS indeed working for LINE 4, because as you can see in the "after" column, it becomes blank. Very good. But the problem is that it's also replacing all the texts in all the other lines (sans
    ), but I don't want that.
    – Joe Oct 25 '22 at 17:24
  • Did you try **^(?!.*
    ).*$** expression?
    – Vitalii Oct 25 '22 at 21:11
  • Yes I did, same result. I think it is because what it shows there as new line is artificial: https://i.imgur.com/KlckO6T.png I found out that it doesn't even matter what I put in between the < >, the result is the same ➜ https://i.imgur.com/VkXsDom.png ➜ https://i.imgur.com/N8iS1oB.png – Joe Oct 26 '22 at 04:12
  • There is some info here about new line in SubtitleEdit, but it's a lil beyond me... https://github.com/SubtitleEdit/subtitleedit/issues/3221 – Joe Oct 26 '22 at 04:19
1

The confusion here was caused by 2 facts:

  1. What SubtitleEdit calls a line is actually a multiline, containing newlines.
  2. The newline displayed is not the one used internally (so it would never match <br>).

Solution 1:

Now that we have found out it uses either \r\n or just \n, we can write a regex:

(?-m)^(?!.*\r?\n)[\s\S]*$

Explanation:

(?-m) - turn off the multiline option (which is otherwise enabled).

^ - match from start of text

(?!.*\r?\n) - negative look ahead for zero or more of any characters followed by newline character(s) - (=Contains)

[\s\S]*$ - match zero or more of ANY character (including newline) - will match the rest of text.

In short: If we don't find newline characters, match everything.

Now replace with an empty string.

Solution 2:

If you want to match lines that doesn't have any English characters, you can use this:

(?-m)^(?![\s\S]*[a-zA-Z])[\s\S]*$

Explanation:

(?-m) - turn off the multiline option (which is otherwise enabled).

^ - match from start of text

(?![\s\S]*[a-zA-Z]) - negative look ahead for ANY characters followed by an English character.

[\s\S]*$ - match zero or more of ANY character (including newline) - will match the rest of text.

In short: If we don't find an English character, match everything.

Now replace with an empty string.

Poul Bak
  • 10,450
  • 5
  • 32
  • 57
  • 1
    Both regex worked perfect as described. Thanks so much! – Joe Oct 28 '22 at 22:24
  • Sorry, could you give the regex for "If we find ONLY English characters, match everything." (including symbols like ! or ? that are part of normal English sentences) – Joe Nov 14 '22 at 16:22