6

I'm using an API that sometimes truncates links inside the text that it returns and instead of "longtexthere https://fancy.link" I get "longtexthere https://fa…".

I'm trying to get to match the link only if it's complete, or in other words does not contain "…" character.

So far I am able to get links by using the following regex:

((?:https?:)?\/\/\S+\/?)

but obviously it returns every link including broken ones.

I've tried to do something like this:

((?:https?:)?\/\/(?:(?!…)\S)+\/?)

Although that started to ignore the "…" character it was still returning the link but just without including the character, so with the case of "https://fa…" it returned "https://fa" whereas I simply want it to ignore that broken link and move on.

Been fighting this for hours and just can't get my head around it. :(

Thanks for any help in advance.

kiradotee
  • 1,205
  • 12
  • 20
  • 1
    Does your regex engine allow possessive quantifiers? Try [`(?:https?:)?\/\/[^\s…]++(?!…)\/?`](https://regex101.com/r/jQ9lQ2/1) – Wiktor Stribiżew Apr 01 '16 at 14:13
  • 1
    Note you can also remove the `\/?` at the end as it will not be matched ever. If your regex flavor is JavaScript or Python, try [`(?!\S+…)(?:https?:)?\/\/\S+`](https://regex101.com/r/jQ9lQ2/2) – Wiktor Stribiżew Apr 01 '16 at 14:21
  • 1
    If possessive quantifiers and lookbehind are supported by your regex flavor you can also try [`(?:https?:)?\/\/\S++(?<!…)`](https://regex101.com/r/jU9jU8/1) The possessive quantifier will prevent from backtracking if the lookbehind does not match. – bobble bubble Apr 01 '16 at 16:16
  • Wow @WiktorStribiżew that worked!!! You should have posted it as an answer as that's the only correct answer. https://regex101.com/r/wC7tO5/1 – kiradotee Apr 04 '16 at 09:19
  • Oh, actually @bobblebubble yours is working too! https://regex101.com/r/zN7jS3/1 – kiradotee Apr 04 '16 at 09:20
  • Thanks guys, you're amazing! :) – kiradotee Apr 04 '16 at 09:20
  • 1
    But what is the regex flavor? Which pattern works for you? – Wiktor Stribiżew Apr 04 '16 at 09:26
  • @user45173 My solution is similar to Wiktors first one, which I vote for. Also bear in mind that it is often essential to specify the regex flavor/tool you're working with. Else it's just guessing for the ones who want to answer. – bobble bubble Apr 04 '16 at 10:55
  • I'm using PHP 5.4, not sure which flavor of regex it uses? – kiradotee Apr 05 '16 at 09:11

4 Answers4

4

You can use

(?:https?:)?\/\/[^\s…]++(?!…)\/?

See the regex demo. The possessive quantifier [^\s…]++ will match all non-whitespace and non- characters without later backtracking and then check if the next character is not . If it is, no match will be found.

As an alternative, if your regex engine allow possessive quantifiers, use a negative lookahead version:

(?!\S+…)(?:https?:)?\/\/\S+\/?

See another regex demo. The lookahead (?!\S+…) will fail the match if 1+ non-whitespace characters are followed with .

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • 1
    Does exactly what I need! Thanks a lot. Also will mention here @bobblebubble suggestion from above: `(?:https?:)?\/\/\S++(?<!…)` as it seems to be similar but working too! – kiradotee Apr 05 '16 at 09:22
  • Yes, it is very similar as it also uses possessive quantifier to prevent backtracking into the character class. `\S++` matches all non-whitespace characters up to a whitespace or end of string and then checks if only the previous char was not an ellipsis. If it is, the match is failed. – Wiktor Stribiżew Apr 05 '16 at 09:25
1

Try:

 ((?:https?:)?\/\/\S+[^ \.]{3}\/?)

Its the same as your original pattern.. you just tell it that the last three characters should not be '.' (period) or ' ' (space)

UPDATE: Your second link worked.

and if you tweak your regex just slightly it will do what you want:

 ((?:https?:)?\/\/\S+[^ …] \/?)

Yes it looks just like what you had in there except I added a ' ' (space) after the part we do not want.. this will force the regular expression to match up until and including the space which it cannot with a url that has the '...' character. Without the space at the end it would match up until the not including the '...' which was why it was not doing what we wanted ;)

Rob
  • 2,618
  • 2
  • 22
  • 29
  • I've modified yours slightly (because it's a special character rather than three dots), although it didn't do the trick https://regex101.com/r/zJ7lM0/1 – kiradotee Apr 01 '16 at 14:41
  • for some reason the url you have is blocked for me. :( – Rob Apr 01 '16 at 15:20
  • Huh, you're the first person who couldn't open regex101.com . Maybe this link will work? http://regexr.com/3d53k – kiradotee Apr 04 '16 at 09:12
  • @user45173 Sorry I did not realize the '...' was a single Unicode character. I was able to make it work by adding a space in the pattern you had on the regexr.com side. See my update. – Rob Apr 04 '16 at 15:11
1

You can try following regex

https?:\/\/\w+(?:\.\w+\/?)+(?!\.{3})(\s|$)

See demo https://regex101.com/r/bS6tT5/3

Saleem
  • 8,728
  • 2
  • 20
  • 34
  • Yes, it was skipping urls ending with `/`. try again. It should match 4. rest are either not valid urls or doesn't match because of urls you have set. – Saleem Apr 04 '16 at 10:35
0

Please try:

https?:\/\/[^ ]*?…|(https?:\/\/[^ ]+\.[^ ]+)

Here is the demo.

Quinn
  • 4,394
  • 2
  • 21
  • 19