0

I need a regex that will select everything from the beginning of the line until the first left square bracket. In the example below it would match Lorem, consectetur-adipisicing and labore et

Lorem [ipsum] dolor sit amet,

consectetur-adipisicing [elit] sed do 

eiusmod tempor incididunt ut

labore et [dolore] magna aliqua.

Thank you for the help.

5 Answers5

2

Using look-behind and look-ahead:

(?<=^|\n)(.*?)(?=\s?\[)

Explanation:

(?<=...) is positive look-behind, checking that the previous characters match.

^|\n is intended to be start of line. Start of text (^) or new-line (\n).

. is any character.

.*? is zero or more of any characters. *? instead of * is non-greedy matching, so it will match up to the first rather than the last bracket.

(?=...) is positive look-ahead, checking that the next characters match.

\s is white-space, the ? makes it optional (this is to prevent the space before the [ from also matching).

\[ is an escaped [ (it needs to be escaped since [ has a different meaning)

Bernhard Barker
  • 54,589
  • 14
  • 104
  • 138
  • Great answer @Dukeling, very detailed. I'd add \s before \[ just to skip the white space. Btw, this matches **eiusmod tempor incididunt ut**, which is the text from the line which doesn't have the **[** char in it. – Rolando Isidoro Apr 24 '13 at 10:03
  • 1
    @RolandoIsidoro Added the `\s`. Are you sure it matches the line without the `[`? This should be prevented with the look-ahead (`(?=\s?\[)`). – Bernhard Barker Apr 24 '13 at 10:14
  • 1
    @RolandoIsidoro Untick "DOT ALL", then it works as required. Another option is to change the `.` to exclude new-lines (probably something like `[^\n]`). – Bernhard Barker Apr 24 '13 at 11:02
2

Why do people use the dot and complicated lookaround constructs when a simple anchor and negated character class will do the trick?

(?m)^[^\[\r\n]+(?=\[)

If your regex flavor supports it, you can further optimize this regex by making the quantifier possessive:

(?m)^[^\[\r\n]++(?=\[)

If your regex flavor doesn't support lookahead, include the [ in the match and use a capturing group to get the part that you want:

(?m)^([^\[\r\n]+)\[

If your regex flavor doesn't supoprt mode modifiers like (?m), simply turn on the option to make ^ match at line breaks ("multi-line mode") outside the regex.

Jan Goyvaerts
  • 21,379
  • 7
  • 60
  • 72
  • You are correct that there is no lookbehind assertion at the start needed, but the one at the end is needed to ensure that there is a `[` following the pattern (your regex matches "eiusmod tempor incididunt ut" wrongly). Also you need the the opening square bracket in your expression not the closing one. I think you meant `(?m)^[^\[\r\n]+(?=\[)`. [Regexr](http://regexr.com?35jtr) – stema Jul 17 '13 at 06:44
  • If lines without any `[` must not be matched, and the `[` must not be included in the match, then you need a lookahead. – Jan Goyvaerts Jul 17 '13 at 07:30
1

try "[^\[]*" [] means a character set, ^\[ means anything except [ and * repeat any number of times. So combined, it should be your answer

abasu
  • 2,454
  • 19
  • 22
0

I would say the most simple version would be:

(.*?)\[.*

Salgar
  • 7,687
  • 1
  • 25
  • 39
  • Thanks Salgar but it matches the whole line, and not only the words preceding the bracket. When I remove .* and apply (.*?)\[ it does what I asked for, except it includes the bracket in the match, and it shouldn't – TotoKalvera Apr 24 '13 at 09:46
  • Ah sorry, I assumed you wanted a group match of the initial part. Go with what abasu said in that case. – Salgar Apr 24 '13 at 09:48
  • Dukeling's expression works like a charm, problem solved, thank you Salgar. – TotoKalvera Apr 24 '13 at 10:04
0

This might be helpful..

^(.*)\[

Simple Example:

my $str ="consectetur-adipisicing [elit] sed do";
my $tmp;
if ($str =~ m/^(.*)\[/) {
    $tmp = $1;
}
print "String upto [: $tmp\n";

output is:

String upto [: consectetur-adipisicing
Suvasish Sarker
  • 425
  • 1
  • 7
  • 21