0

I've been trying to figure out this regex in Python but it's not been producing the expected result.

I have a text file which I load that is in the format of:

"18 75 19\n!dont split here\n! but split here\n* and split here"

I'd like to get the following output:

['18 75 19\n!dont split here',
 '! but split here',
 '* and split here']

I'm trying to split my string by either 1) a new line followed by a number, or 2) a new line followed by a special character only if it is followed by a space (e.g. '! but split here', but not '!dont split here').

Here's what I have so far:

re.split(u'\n(?=[0-9]|([`\-=~!@#$%^&*()_+\[\]{};\'\\:"|<,./<>?])(?= ))', str)

This is close, but not there yet. Here's the output it produces:

['18 75 19\n!dont split here', '!', '! but split here', '*', '* and split here']

It incorrectly matches the special character separately: '!' and '*' have their own element. There are two lookahead operators in the regex.

I'd really appreciate if you could help identify what I could change with this regex for it to not match the single special character, and just match the special character followed by the full line.

I'm also open to alternatives. If there's a better way that doesn't involve two lookaheads, I'd also be interested to understand other ways to tackle this problem.

Thanks!

Rohan
  • 455
  • 1
  • 3
  • 11
  • Why should `'18 75 19\n!dont split here'` not split? Doesn't the new line character follow a number in that case? I get that there's no space after the `!`, but your first condition matches?: "either 1) a new line followed a number, or 2)" – Grismar Feb 03 '20 at 02:13
  • @Grismar I think it's supposed to read "a new line followed *by* a number" – Nick Feb 03 '20 at 02:15
  • That would make more sense, although the example data doesn't have a case of that - but I agree that would work. – Grismar Feb 03 '20 at 03:22
  • 2
    @Nick already provided a succinct and correct answer, just a remark: don't use `str` as a variable name, as it will shadow the Python type `str`, which will lead to all kinds of hard to find problems. Either use something like `s` or, if you insist on `str`, use `str_` instead. – Grismar Feb 03 '20 at 03:28

1 Answers1

4

Your regex is actually working, the issue is with the capturing group you have around [`\-=~!@#$%^&*()_+\[\]{};\'\\:"|<,./<>?]. From the manual:

If capturing parentheses are used in pattern, then the text of all groups in the pattern are also returned as part of the resulting list

If you remove the () around that character class, you will get the results you expect.

Note that you don't need (?= ) in that alternation as it is already part of a lookahead, you can just use (space). Also you might find it easier to write the symbols as a negated character class i.e.

re.split(u'\n(?=[0-9]|[^A-Za-z0-9] )', str)
Nick
  • 138,499
  • 22
  • 57
  • 95