0

I've this exercise:

Having these links

1. http://example.com/cat1/subcat3/subcat4/tag/this%20is%20page/asdasda?start=130
2. http://example.com/cat1/subcat3/subcat4/tag/this%20is%20pageasdasd
3. example.it/news/tag/this%is%20n%page?adsadsadasd
4. http://example.com/tag/thispage/asdasdasd.-?asds=
5. http://example.com/tag/this%20is%20page/asdasd
6. /tag/this/asdasdasd
7. /tag/asd-asd/feed/this-feed
8. /tag/sd-asd
  • In first case the result must be: http://example.com/tag/this%20is%20page
  • In second case the result must be: http://example.com/tag/this%20is%20pageasdasd
  • In third case the result must be: example.it/tag/this%is%20n%page
  • In forth case the result must be: http://example.com/tag/thispage
  • In fifth case the result must be: http://example.com/tag/this%20is%20page
  • In sixth case the result must be: /tag/this
  • In seventh case the result must be: /tag/asd-asd

But eighth must be not consider by regex. The same is for domain name.

I tried to make it: https://regex101.com/r/aB5mPn/5 but i'm not able to not consider the last case.

Anyone can help me?

Kouga
  • 27
  • 1
  • 9
  • Your regex looks to be working well. What exactly is your problematic for the last case? – PJProudhon Feb 02 '18 at 13:13
  • Hello! Just not consider that case – Kouga Feb 02 '18 at 13:14
  • 1
    Doesn't "*not consider*" mean "*don't change stuff on it*" ? Or is it you should completely erase it ? If the first, then your regex seems fine, else, I don't see the logic behind – Rafalon Feb 02 '18 at 13:16
  • @Rafalon As you can see the domain name, does not considered from regex (not match). The same should be for last case. – Kouga Feb 02 '18 at 13:23
  • "*As you can see the domain name, does not considered from regex*" - what ? Was this google translated ? – Rafalon Feb 02 '18 at 13:25
  • "_As you can see, the regex doesn't catch domain name. The same should be for last case._" Now is better? – Kouga Feb 02 '18 at 13:29
  • 1
    I'm sorry, I don't see the difference between 7 and 8, and event though you catch the 8th case, you replace it with itself, so the result is the same as domain name, isn't it ? – Rafalon Feb 02 '18 at 13:32
  • @Rafalon The seventh has segments after that in front of "tag". Instead the last case has nothing after. Anyway by regex, the domain name never catched. – Kouga Feb 02 '18 at 13:37

1 Answers1

2

If I am not mistaken, you could add a negative lookahead before matching /tag...etc to assert that what follows for the eight case is not /tag/sd-asd until the end of the string (?!\/tag\/[^\/]+$)

Your regex could look like:

(?:(?:\/[A-Za-z0-9-]+)?)+(?!\/tag\/[^\/]+$)(\/tag\/[A-Za-z0-9-%]+)(.*)

The fourth bird
  • 154,723
  • 16
  • 55
  • 70
  • Best! Thank you! – Kouga Feb 02 '18 at 13:40
  • Even if it works, I see no added value using that regex. It adds a lot more backtracking just to prevent one identity replacement? – PJProudhon Feb 02 '18 at 13:49
  • @The fourth bird if i add more string, the regex doesn't work anymore: [link](https://regex101.com/r/aB5mPn/7). As you can see, third last now match and the last, not. – Kouga Feb 02 '18 at 23:27
  • @Kouga That is because the lookahead checks if it is at the end of the string, and by adding content this will not be the end of the string anymore. What you could do it use anchor for the beginning `^` and the end of the line `$`. For example: [`^(?!\/tag\/[^\/]+$)(?:(?:https?:\/\/)?[^\/]+)?(.*(?=\/tag))(\/tag\/.*?(?=\/?))([\/?].*$)?$`](https://regex101.com/r/bNh6v9/1/) The part starting with `/tag/untilthefirstforwardslash` will be in group 2. The part that excludes `/tag/sd-asd` at the start of the string is inside a negative lookahead in the beginning: `(?!\/tag\/[^\/]+$)` – The fourth bird Feb 03 '18 at 13:18
  • @The fourth bird Now instead you match also `http://example.com`. Furthermore `http://example.com/tag/this%is` or `http://example.com/tag/this-is` must not match. In summary, i need to get `/tag/` only if this pattern is preceded AND / OR followed by other segments without matching domain name, seems like your preceded answer. – Kouga Feb 03 '18 at 13:28
  • @Thefourthbird [link](https://regex101.com/r/aB5mPn/8) like this but last three test string must not be considered! – Kouga Feb 03 '18 at 14:13
  • 1
    @Kouga Do you need to keep all the current selections and groups or do you need just 1 selection? What are you trying to accomplish? If your regex engine supports \K, you could use [`(?:\/[A-Za-z0-9-%]+\K)\/tag\/[A-Za-z0-9-%]+|\/tag\/[A-Za-z0-9-%]+(?=\/[A-Za-z0-9-%]+)`](https://regex101.com/r/BLtmkp/1). You could capture those in a [named captured group](https://regex101.com/r/K6rGCh/1). Or you could try it [like this](https://regex101.com/r/Loq3uI/1) – The fourth bird Feb 03 '18 at 16:11
  • @Thefourthbird last one si almost perfect! If i call `${tag}` group, in Match 2 and 3 i get also previous segment, as you can see: [link](https://regex101.com/r/Loq3uI/1) – Kouga Feb 03 '18 at 17:21
  • @Thefourthbird i tried to remove the previous segment to the one with `/tag/segment` from `${tag}` group but I can not :') – Kouga Feb 04 '18 at 10:07