8

If I have a large string with multiple lines and I want to match part of a line only to end of that line, what is the best way to do that?

So, for example I have something like this and I want it to stop matching when it reaches the new line character.

r"(?P<name>[A-Za-z\s.]+)"

I saw this in a previous answer:

$ - indicates matching to the end of the string, or end of a line if multiline is enabled.

My question is then how do you "enable multiline" as the author of that answer states?

Community
  • 1
  • 1

3 Answers3

12

Simply use

r"(?P<name>[A-Za-z\t .]+)"

This will match ASCII letters, spaces, tabs or periods. It'll stop at the first character that's not included in the group - and newlines aren't (whereas they are included in \s, and because of that it's irrelevant whether multiline mode is turned on or off).

Tim Pietzcker
  • 328,213
  • 58
  • 503
  • 561
2

You can enable multiline matching by passing re.MULTILINE as the second argument to re.compile(). However, there is a subtlety to watch out for: since the + quantifier is greedy, this regular expression will match as long a string as possible, so if the next line is made up of letters and whitespace, the regex might match more than one line ($ matches the end of any string).

There are three solutions to this:

  1. Change your regex so that, instead of matching any whitespace including newline (\s) your repeated character set does not match that newline.
  2. Change the quantifier to +?, the non-greedy ("minimal") version of +, so that it will match as short a string as possible and therefore stop at the first newline.
  3. Change your code to first split the text up into an individual string for each line (using text.split('\n').
azernik
  • 1,179
  • 1
  • 8
  • 16
  • Thanks for the solutions! The first one sounds the easiest to implement. Do you know specifically how I can specify I only want single spaces to be matched as opposed to any whitespace? I tried the second solution but it only matches a single character. –  Sep 09 '11 at 20:02
  • 1
    My bad, should have mentioned - for all these solutions, you should also include the `$` (end of string) anchor at the end. That way, with solution 2, `re` will find the shortest string that matches the regex *and* goes up to the end of a line, which is what you want. For solution 1, a space can be represented in a character set by a literal space - no escaping required (i.e. `[A-Za-z .]`) – azernik Sep 09 '11 at 21:56
1

Look at the flags parameter at http://docs.python.org/library/re.html#module-contents

rocksportrocker
  • 7,251
  • 2
  • 31
  • 48