3

I have some confusion regarding the pattern matching in the following expression. I tried to look up online but couldn't find an understandable solution:

imgurUrlPattern = re.compile(r'(http://i.imgur.com/(.*))(\?.*)?')

What exactly are the parentheses doing ? I understood up until the first asterisk , but I can't figure out what is happening after that.

jaco0646
  • 15,303
  • 7
  • 59
  • 83
Vatsal Mishra
  • 109
  • 2
  • 10
  • Are you trying to figure out what the actual URL is based on the regex in your OP? If so, it's not really possible. – Joe T. Boka Jun 28 '15 at 06:05

3 Answers3

2

Regular expressions can be represented as graphs to understand there operation. A parallel connection between nodes indicate that it is optional a serial connection indicates taht it is mandatory and a loop indicated repitition over the same node.

(http://i.imgur.com/(.*))(\?.*)?

Regular expression visualization

Debuggex Demo

So this starts with an imgur URL http://i.imgur.com/(.*) (mandatorily) having any characters untill a '?'(optional) is encountered. Following any characters after the '?'. Notice '?' has been escaped of its regular behaviour. The pink highlights indicate the capture groups.

Identity1
  • 1,139
  • 16
  • 33
1

The (.*) means any character repeated any amount of times, the (\?.*)? matches the query string of a url for example (a imgur search of "cat"):

http://imgur.com/search?q=cat

http://imgur.com/search is matched by the (http://i.imgur.com/(.*)) (the search is specifically matched by the (.*)) section of the regex. The ?q=cat is matched by the (\?.*)? of the regex. In the regex the ? in the end means optional, so it means there might or might not be a query string. There is no query string in the url http://www.imgur.com. The parenthesis are used for grouping. We want to group (http://i.imgur.com/(.*)) as one thing because it matches the url, and there is another group within this that matches the page you are request (this is (.*)). We want to group (\?.*)? because it matches the query string.

Here is a diagram to help you enter image description here

abden003
  • 1,325
  • 7
  • 24
  • 48
1
(http://i.imgur.com/(.*))(\?.*)?

The first capturing group (http://i.imgur.com/(.*)) means that the string should start with http://i.imgur.com/ followed by any number of characters (.*) (this is a poor regex, you shouldn't do it this way). (.*) is also the second capturing group.

The third capturing group (\?.*) means that this part of the string must start with ? and then contain any number of any characters, as above.

The last ? means that the last capturing group is optional.

EDIT: These groups can then be used as:

p = re.compile(r'(http://i.imgur.com/(.*))(\?.*)?')
m = p.match('ab')
m.group(0);
m.group(2);

To improve the regex, you must limit the engine to what characters you need, like:

(http://i.imgur.com/([A-z0-9\-]+))(\?[[^/]+*)?

[A-z0-9\-]+ limit to alphanumeric characters
[^/] exclude /

xyz
  • 3,349
  • 1
  • 23
  • 29