2

I did not find any good answer to this question so I share what I found and works

if you want to remove all the google analytics terms from an URL, you mostly want to keep the other parameters and get a clean valid URL at the end

url = url.replace(/(\&|\?)utm([_a-z0-9=+\-]+)/igm, "$1");

with a url like this https://www.somewebsite.fr/produit/yi-camera-3600-noir-vr-33705370/offre-81085802?utm_source=325483&utm_medium=affiliation&utm_content=catalogue-RDC&awc=6901_1530705916_88ef12642ad61dfc5239ba01bbbe5249

you will get this https://www.somewebsite.fr/produit/yi-camera-3600-noir-vr-33705370/offre-81085802?&&&awc=6901_1530705916_88ef12642ad61dfc5239ba01bbbe5249

this url is already valid but we have some dupe & signs if you remove the $1 from the first request you will with only a & sign and not the ? that you should have in the beginning

so next clean up we keep the first ? sign => $1 and remove the other leading &

url = url.replace(/(\?)\&+/igm, "$1");

here we have a nice clean URL

full version :

url = url.replace(/(\&|\?)utm([_a-z0-9=+\-]+)/igm, "$1");
url = url.replace(/(\?)\&+/igm, "$1");

if you can find a one liner you're welcome

Edit : the resulting URL should be this one : https://www.somewebsite.fr/produit/yi-camera-3600-noir-vr-33705370/offre-81085802?awc=6901_1530705916_88ef12642ad61dfc5239ba01bbbe5249

benraay
  • 783
  • 8
  • 14

2 Answers2

6

I think it could be as simple as: url = url.replace(/(?<=&|\?)utm_.*?(&|$)/igm, "");

You do not need to escape &

(?<=&|\?) = positive lookbehind

.*? = everything, but "not greedy"

Fallenhero
  • 1,563
  • 1
  • 8
  • 17
  • very nice also more strong on parameter values ! thanks – benraay Jul 05 '18 at 10:07
  • 1
    Note it contains a lookbehind that is only supported by ECMAScript 2018 compatible JS environments. Also, it does not remove the trailing `?` if the query string only contains `utm` params. See [this demo](https://regex101.com/r/TCHjRU/1). `https://www.somewebsite.fr/produit/yi-camera-3600-noir-vr-33705370/offre-81085802?utm=6901_1530705916_88ef12642ad61dfc5239ba01bbbe5249&utm=ewe` ends up as `https://www.somewebsite.fr/produit/yi-camera-3600-noir-vr-33705370/offre-81085802?`. – Wiktor Stribiżew Jul 05 '18 at 10:08
  • exactly I didn't know that my need is for node js 8 and it works fine with that version – benraay Jul 05 '18 at 11:09
  • @benraay Do you mean it is OK for you that [`https://www.some.fr/812?utm=6901&utm=ewe` becomes `https://www.some.fr/812?`](https://regex101.com/r/TCHjRU/2)? – Wiktor Stribiżew Jul 05 '18 at 11:51
  • @benraay Ok, I think I misinterpreted the goal then. You said that `https://www.some.fr/812?&&aws=true` is already OK, and I thought you wanted as cleanly-looking pattern as possible. – Wiktor Stribiżew Jul 05 '18 at 13:04
  • @WiktorStribiżew yes the first on is already ok, the purpose is to scrape informations on that page after this cleaning but the url is visible to people, "?&&&" is a bit ugly one "?" sign at the end looks ok – benraay Jul 06 '18 at 14:03
3

You may use a single regex compatible with all JS versions that will

  • match and capture ? that is followed by 1 or more utm param that are followed with a param other than utm one and replace with $1 to restore that ? since it is necessary
  • or, match any ? with 1 or more utm params in the query string where no params other than utm are present (so, $1 will be empty, and ? will get removed)
  • or, just match all utm params to remove them.

The regex will look like

.replace(/(\?)utm[^&]*(?:&utm[^&]*)*&(?=(?!utm[^\s&=]*=)[^\s&=]+=)|\?utm[^&]*(?:&utm[^&]*)*$|&utm[^&]*/gi, '$1')

See the regex demo

Details

  • (\?)utm[^&]*(?:&utm[^&]*)*&(?=(?!utm[^\s&=]*=)[^\s&=]+=) - ?utm (with ? inside a capturing group later referenced with $1), 0+ chars other than &, and then 0 or more repetitions of &utm followed with 0+ chars other than & and then a & that is followed with 0+ chars other than whitespace, & and = and then = that is not utm param
  • | - or
  • \?utm[^&]*(?:&utm[^&]*)*$ - ?utm, 0+ chars other than &, and then 0 or more repetitions of &utm followed with 0+ chars other than & and then the end of the string
  • | - or
  • &utm[^&]* - a &, utm and then 0+ chars other than &

JS demo:

var urls = ['https://www.somewebsite.fr/produit/yi-camera-3600-noir-vr-33705370/offre-81085802?utm_source=325483&utm_medium=affiliation&utm_content=catalogue-RDC&awc=6901_1530705916_88ef12642ad61dfc5239ba01bbbe5249', 'https://www.somewebsite.fr/produit/yi-camera-3600-noir-vr-33705370/offre-81085802?t=55&utm_source=325483&utm_medium=affiliation&utm_content=catalogue-RDC&awc=6901_1530705916_88ef12642ad61dfc5239ba01bbbe5249','https://www.somewebsite.fr/produit/yi-camera-3600-noir-vr-33705370/offre-81085802?awc=6901_1530705916_88ef12642ad61dfc5239ba01bbbe5249&utm_tt=78', 'https://www.somewebsite.fr/produit/yi-camera-3600-noir-vr-33705370/offre-81085802?utm=6901_1530705916_88ef12642ad61dfc5239ba01bbbe5249&utm=ewe'];

var u = 'utm[^&]*';
var rx = new RegExp("(\\?)"+u+"(?:&"+u+")*&(?=(?!utm[^\s&=]*=)[^\s&=]+=)|\\?"+u+"(?:&"+u+")*$|&"+u, "ig");
for (var url of urls) {
  console.log(url, "=>", url.replace(rx, '$1'));
}
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • really nice and detailed solution but the readability is a bit lost, I prefer my first two liner version it's much more easy to understand – benraay Jul 05 '18 at 11:05
  • 1
    @benraay The readability is lost, true, however, it embraces all possible scenarios and shows how to build the pattern dynamically. The solution can be tailored to remove other params easily, and it can be used with regex libraries that do not support lookbehinds, like VBA or JS versions that do not support ECMAScript 2018 standard. – Wiktor Stribiżew Jul 05 '18 at 11:14