Remove segments and replaces in url with regex

Question

I've this exercise:

Having these links

1. http://example.com/cat1/subcat3/subcat4/tag/this%20is%20page/asdasda?start=130
2. http://example.com/cat1/subcat3/subcat4/tag/this%20is%20pageasdasd
3. example.it/news/tag/this%is%20n%page?adsadsadasd
4. http://example.com/tag/thispage/asdasdasd.-?asds=
5. http://example.com/tag/this%20is%20page/asdasd
6. /tag/this/asdasdasd
7. /tag/asd-asd/feed/this-feed
8. /tag/sd-asd

In first case the result must be: http://example.com/tag/this%20is%20page
In second case the result must be: http://example.com/tag/this%20is%20pageasdasd
In third case the result must be: example.it/tag/this%is%20n%page
In forth case the result must be: http://example.com/tag/thispage
In fifth case the result must be: http://example.com/tag/this%20is%20page
In sixth case the result must be: /tag/this
In seventh case the result must be: /tag/asd-asd

But eighth must be not consider by regex. The same is for domain name.

I tried to make it: https://regex101.com/r/aB5mPn/5 but i'm not able to not consider the last case.

Anyone can help me?

Your regex looks to be working well. What exactly is your problematic for the last case? — PJProudhon, Feb 02 '18 at 13:13
Doesn't "*not consider*" mean "*don't change stuff on it*" ? Or is it you should completely erase it ? If the first, then your regex seems fine, else, I don't see the logic behind — Rafalon, Feb 02 '18 at 13:16
@Rafalon As you can see the domain name, does not considered from regex (not match). The same should be for last case. — Kouga, Feb 02 '18 at 13:23
"*As you can see the domain name, does not considered from regex*" - what ? Was this google translated ? — Rafalon, Feb 02 '18 at 13:25
"_As you can see, the regex doesn't catch domain name. The same should be for last case._" Now is better? — Kouga, Feb 02 '18 at 13:29
I'm sorry, I don't see the difference between 7 and 8, and event though you catch the 8th case, you replace it with itself, so the result is the same as domain name, isn't it ? — Rafalon, Feb 02 '18 at 13:32
@Rafalon The seventh has segments after that in front of "tag". Instead the last case has nothing after. Anyway by regex, the domain name never catched. — Kouga, Feb 02 '18 at 13:37

score 2 · Accepted Answer · answered Feb 02 '18 at 13:33

2

If I am not mistaken, you could add a negative lookahead before matching /tag...etc to assert that what follows for the eight case is not /tag/sd-asd until the end of the string (?!\/tag\/[^\/]+$)

Your regex could look like:

(?:(?:\/[A-Za-z0-9-]+)?)+(?!\/tag\/[^\/]+$)(\/tag\/[A-Za-z0-9-%]+)(.*)

answered Feb 02 '18 at 13:33

The fourth bird

154,723
16
55
70

Best! Thank you! – Kouga Feb 02 '18 at 13:40
Even if it works, I see no added value using that regex. It adds a lot more backtracking just to prevent one identity replacement? – PJProudhon Feb 02 '18 at 13:49
@The fourth bird if i add more string, the regex doesn't work anymore: [link](https://regex101.com/r/aB5mPn/7). As you can see, third last now match and the last, not. – Kouga Feb 02 '18 at 23:27
@Kouga That is because the lookahead checks if it is at the end of the string, and by adding content this will not be the end of the string anymore. What you could do it use anchor for the beginning `^` and the end of the line `$`. For example: [`^(?!\/tag\/[^\/]+$)(?:(?:https?:\/\/)?[^\/]+)?(.*(?=\/tag))(\/tag\/.*?(?=\/?))([\/?].*$)?$`](https://regex101.com/r/bNh6v9/1/) The part starting with `/tag/untilthefirstforwardslash` will be in group 2. The part that excludes `/tag/sd-asd` at the start of the string is inside a negative lookahead in the beginning: `(?!\/tag\/[^\/]+$)` – The fourth bird Feb 03 '18 at 13:18
@The fourth bird Now instead you match also `http://example.com`. Furthermore `http://example.com/tag/this%is` or `http://example.com/tag/this-is` must not match. In summary, i need to get `/tag/` only if this pattern is preceded AND / OR followed by other segments without matching domain name, seems like your preceded answer. – Kouga Feb 03 '18 at 13:28
@Thefourthbird [link](https://regex101.com/r/aB5mPn/8) like this but last three test string must not be considered! – Kouga Feb 03 '18 at 14:13
1

@Kouga Do you need to keep all the current selections and groups or do you need just 1 selection? What are you trying to accomplish? If your regex engine supports \K, you could use [`(?:\/[A-Za-z0-9-%]+\K)\/tag\/[A-Za-z0-9-%]+|\/tag\/[A-Za-z0-9-%]+(?=\/[A-Za-z0-9-%]+)`](https://regex101.com/r/BLtmkp/1). You could capture those in a [named captured group](https://regex101.com/r/K6rGCh/1). Or you could try it [like this](https://regex101.com/r/Loq3uI/1) – The fourth bird Feb 03 '18 at 16:11
@Thefourthbird last one si almost perfect! If i call `${tag}` group, in Match 2 and 3 i get also previous segment, as you can see: [link](https://regex101.com/r/Loq3uI/1) – Kouga Feb 03 '18 at 17:21
@Thefourthbird i tried to remove the previous segment to the one with `/tag/segment` from `${tag}` group but I can not :') – Kouga Feb 04 '18 at 10:07

Remove segments and replaces in url with regex

1 Answers1