1

I have the following regexp (using Python syntax):

(\d+)x(\d+)(?:\s+)?-(?:\s+)?([^\(\)]+)(?:\s+)?\((\d+)(?:(?:\s+)?-(?:\s+)?([^\(\)]+))?\)(?:(?:\s+)?\(([^\(\)]+)\))?(?:(?:\s+)?-(?:\s+)?([^\(\)]+) \((\d+)\))?

It matches strings which fit one of the following forms:

21x04 - Some Text (04)
6x03 - Some Text (00 - Some Text)
6x03 - Some Text (00 - Some Text) (Some Text)
23x01 - Some Text (10) - Some Text (02)

The numbers and text vary, and are captured. However, the spacing is not always consistent, so it is designed to allow for any number of spaces.

Is there a way of simplifying it - I'm not necessarily asking for someone to do this for me, just to tell me if there are tools (a Google search yielded a few results, but none of them could handle it), or a systematic method of doing this.

Or can anyone see a better regex that fits this scenario?

p0llard
  • 439
  • 6
  • 17
  • What is the regex supposed to do? Are the capture groups important, or did you simply added brackets? – Willem Van Onsem Apr 08 '15 at 22:35
  • 1
    The capture groups are important. Where they are not, I've used (?: ). I'm adding a description. – p0llard Apr 08 '15 at 22:36
  • Requests for tools are off-topic. *"Is there a way of simplifying it"* feels a little too broad. What is this regex actually **for**? – jonrsharpe Apr 08 '15 at 22:38
  • For one simplification, note that you can replace e.g. `(?:\s+)?` (one or more whitespace characters, optionally) with `\s*` (zero or more whitespace characters) – jonrsharpe Apr 08 '15 at 22:42
  • @jonrsharpe I have a lot of filenames of videos/music (from a backup) which are in one of the formats illustrated in the posts, with inconsistent spacing. I'm trying to extract data from the file names so that I can categorise them again. I can't see any other way of doing so (ie. not using regex) as they don't have any metadata associated with them. – p0llard Apr 08 '15 at 22:44
  • 2
    I would say the most important first step would be to put the regex into [verbose](https://docs.python.org/2/library/re.html#re.X) form, with comments, so it's actually readable. – user2357112 Apr 08 '15 at 22:47
  • You could also use named capture groups to clarify what's going on (e.g. `(?P\d+)x(?P\d+)`). What metadata are you trying to extract - what could be in each group? – jonrsharpe Apr 08 '15 at 22:48
  • @jonrsharpe I'll edit the post to make it more readable. – p0llard Apr 08 '15 at 22:51
  • Thanks. What is your goal with the simplification - make it more readable, shorter, faster, ...? – jonrsharpe Apr 08 '15 at 22:56
  • @jonrsharpe More readable is the main goal. Whilst it works, I don't really like it being so messy. – p0llard Apr 08 '15 at 22:58

2 Answers2

1

You can discard some noncapturing group that are optional, for instance you can change this:

(\d+)x(\d+)(?:\s+)?-(?:\s+)?([^\(\)]+)(?:\s+)?\((\d+)(?:(?:\s+)?-(?:\s+)?([^\(\)]+))?\)(?:(?:\s+)?\(([^\(\)]+)\))?(?:(?:\s+)?-(?:\s+)?([^\(\)]+) \((\d+)\))?

To this:

(\d+)x(\d+)\W+([^()]+)\D+\((\d+)(?:\W*-\W*([^()]+))?\)(?:\W*\(([^()]+)\))?(?:\W*-\W*([^()]+) \((\d+)\))?

Working demo

I could replace some (?:\s+)? by \W* and also you don't have to escape parentheses in character classes [^\(\)] you could use [^()]

Btw, you can test this regex too, it might be useful for you:

(\d+)x(\d+)|-\s*([\w\s]+)|(\w+)

Working demo

Federico Piazza
  • 30,085
  • 15
  • 87
  • 123
  • Why all the switches to `\W` instead of `\s`? I think some of those might be incorrect. – user2357112 Apr 08 '15 at 23:06
  • @user2357112 They generated the same output as the original OP regex, you can test that in the first working demo link. It works for what OP used to match – Federico Piazza Apr 08 '15 at 23:08
0

In order to simplify the problem, consider breaking it in two parts: 1. get the strings (could contain numbers or letters) and 2. get the numbers when the strings contain numbers:

data = '''21x04 - Some Text (04)
6x03 - Some Text (00 - Some Text)
6x03 - Some Text (00 - Some Text) (Some Text)
23x01 - Some Text (10) - Some Text (02)'''

import re

# the regex to extract your data as strings
aaa = re.compile('[\w\s]+')

# the regex to extract the numbers from the strings
nnn = re.compile('\d+')

for line in data.split('\n'):
    matches = aaa.findall(line)
    groups = []
    for m in matches:
        m = m.strip()
        n = nnn.findall(m)
        if m != '':
            groups.extend([m] if n == [] else n)
    print(groups)

    # ['21', '04', 'Some Text', '04']
    # ['6', '03', 'Some Text', '00', 'Some Text']
    # ['6', '03', 'Some Text', '00', 'Some Text', 'Some Text']
    # ['23', '01', 'Some Text', '10', 'Some Text', '02']
chapelo
  • 2,519
  • 13
  • 19