0

I've implemented non-greedy regex on a group of string URLs, where I'm trying to clean them up so that they end after the .com (.co.uk etc). Some of them continued with ' or " or < after the desired cutoff, and so I used x = re.findall('([A-Za-z0-9]+@\S+.co\S*?)[\'"<]', finalSoup2).

The problem is that some URLs are misc@misc.misc'misc''misc' (or similar with < >) and so after implementing the non-greedy regex I'm still left with enquiries@smart-traffic.com.au">enquiries@smart-traffic.com.au, for example.

I've tried two ??'s together, but obviously not working, so what's they proper way to acheive clean URLs in this situation?

DanielSon
  • 1,415
  • 4
  • 27
  • 40
  • or maybe just made the \S+ non-greedy : [A-Za-z0-9]+@\S+?.co\S*?)[\'"<] – baddger964 Jul 01 '16 at 14:46
  • Question is not clear to me. Can you please give some details. Examples will do. – Jithin Pavithran Jul 01 '16 at 14:57
  • 2
    On a side note, parsing HTML with regex is **not the best idea**, especially since you're already using BeatifulSoup. If you need, say, to get all `a`s that have `href`s and extract those `href`s, BS allows you to do exactly that in a couple lines, without any regex tomfoolery. – Daerdemandt Jul 01 '16 at 15:09
  • I'm in the processes right now of learning the correct usages of different techniques. Your comment is actually very helpful – DanielSon Jul 01 '16 at 15:11

1 Answers1

2

The issue with your regex is that you currently are only looking for Non-spaces(period)co instead of looking for Non-spaces(period)Non-spaces.

So in this case you could get away with the following regex based on the information above.

>>> finalSoup2 = """
... misc@misc.misc'misc''misc
... enquiries@smart-traffic.com.au">enquiries@smart-traffic.com.au
... google.com
... google.co.uk"'<>Stuff
... """
>>>x = re.findall('([A-Za-z0-9]+@[^\'"<>]+)[\'"<]', finalSoup2)
>>>x
['misc@misc.misc',
 'enquiries@smart-traffic.com.au',
 'enquiries@smart-traffic.com.au\ngoogle.com\ngoogle.co.uk']

Which you can then use to get the urls that you'd like but you'd have to make sure to split them on r'\n' as they may have a newline character within the text as seen above.

Cory Shay
  • 1,204
  • 8
  • 12