Simplifying Regexp

Question

I have the following regexp (using Python syntax):

(\d+)x(\d+)(?:\s+)?-(?:\s+)?([^\(\)]+)(?:\s+)?\((\d+)(?:(?:\s+)?-(?:\s+)?([^\(\)]+))?\)(?:(?:\s+)?\(([^\(\)]+)\))?(?:(?:\s+)?-(?:\s+)?([^\(\)]+) \((\d+)\))?

It matches strings which fit one of the following forms:

21x04 - Some Text (04)
6x03 - Some Text (00 - Some Text)
6x03 - Some Text (00 - Some Text) (Some Text)
23x01 - Some Text (10) - Some Text (02)

The numbers and text vary, and are captured. However, the spacing is not always consistent, so it is designed to allow for any number of spaces.

Is there a way of simplifying it - I'm not necessarily asking for someone to do this for me, just to tell me if there are tools (a Google search yielded a few results, but none of them could handle it), or a systematic method of doing this.

Or can anyone see a better regex that fits this scenario?

What is the regex supposed to do? Are the capture groups important, or did you simply added brackets? — Willem Van Onsem, Apr 08 '15 at 22:35
The capture groups are important. Where they are not, I've used (?: ). I'm adding a description. — p0llard, Apr 08 '15 at 22:36
Requests for tools are off-topic. *"Is there a way of simplifying it"* feels a little too broad. What is this regex actually **for**? — jonrsharpe, Apr 08 '15 at 22:38
For one simplification, note that you can replace e.g. `(?:\s+)?` (one or more whitespace characters, optionally) with `\s*` (zero or more whitespace characters) — jonrsharpe, Apr 08 '15 at 22:42
@jonrsharpe I have a lot of filenames of videos/music (from a backup) which are in one of the formats illustrated in the posts, with inconsistent spacing. I'm trying to extract data from the file names so that I can categorise them again. I can't see any other way of doing so (ie. not using regex) as they don't have any metadata associated with them. — p0llard, Apr 08 '15 at 22:44
I would say the most important first step would be to put the regex into [verbose](https://docs.python.org/2/library/re.html#re.X) form, with comments, so it's actually readable. — user2357112, Apr 08 '15 at 22:47
You could also use named capture groups to clarify what's going on (e.g. `(?P\d+)x(?P\d+)`). What metadata are you trying to extract - what could be in each group? — jonrsharpe, Apr 08 '15 at 22:48
Thanks. What is your goal with the simplification - make it more readable, shorter, faster, ...? — jonrsharpe, Apr 08 '15 at 22:56
@jonrsharpe More readable is the main goal. Whilst it works, I don't really like it being so messy. — p0llard, Apr 08 '15 at 22:58

Federico Piazza · Answer 1 · 2015-04-08T23:00:03.557

1

You can discard some noncapturing group that are optional, for instance you can change this:

(\d+)x(\d+)(?:\s+)?-(?:\s+)?([^\(\)]+)(?:\s+)?\((\d+)(?:(?:\s+)?-(?:\s+)?([^\(\)]+))?\)(?:(?:\s+)?\(([^\(\)]+)\))?(?:(?:\s+)?-(?:\s+)?([^\(\)]+) \((\d+)\))?

To this:

(\d+)x(\d+)\W+([^()]+)\D+\((\d+)(?:\W*-\W*([^()]+))?\)(?:\W*\(([^()]+)\))?(?:\W*-\W*([^()]+) \((\d+)\))?

Working demo

I could replace some (?:\s+)? by \W* and also you don't have to escape parentheses in character classes [^\(\)] you could use [^()]

Btw, you can test this regex too, it might be useful for you:

(\d+)x(\d+)|-\s*([\w\s]+)|(\w+)

Working demo

edited Apr 08 '15 at 23:00

answered Apr 08 '15 at 22:54

Federico Piazza

30,085
15
87
123

Why all the switches to `\W` instead of `\s`? I think some of those might be incorrect. – user2357112 Apr 08 '15 at 23:06
@user2357112 They generated the same output as the original OP regex, you can test that in the first working demo link. It works for what OP used to match – Federico Piazza Apr 08 '15 at 23:08

chapelo · Answer 2 · 2015-04-09T01:31:12.773

In order to simplify the problem, consider breaking it in two parts: 1. get the strings (could contain numbers or letters) and 2. get the numbers when the strings contain numbers:

data = '''21x04 - Some Text (04)
6x03 - Some Text (00 - Some Text)
6x03 - Some Text (00 - Some Text) (Some Text)
23x01 - Some Text (10) - Some Text (02)'''

import re

# the regex to extract your data as strings
aaa = re.compile('[\w\s]+')

# the regex to extract the numbers from the strings
nnn = re.compile('\d+')

for line in data.split('\n'):
    matches = aaa.findall(line)
    groups = []
    for m in matches:
        m = m.strip()
        n = nnn.findall(m)
        if m != '':
            groups.extend([m] if n == [] else n)
    print(groups)

    # ['21', '04', 'Some Text', '04']
    # ['6', '03', 'Some Text', '00', 'Some Text']
    # ['6', '03', 'Some Text', '00', 'Some Text', 'Some Text']
    # ['23', '01', 'Some Text', '10', 'Some Text', '02']

Simplifying Regexp

2 Answers2