How would you clean up URL from tracking query string variables?

Question

I have some 1000s of URLs and the task is to count their occurrences and print top frequent. The problem occurs when single article has multiple URLs. Example below:

http://mashable.com/2013/06/05/whistle/?utm_campaign=Feed:+Mashable+(Mashable)&utm_cid=Mash-Product-RSS-Pheedo-All-Partial&utm_medium=twitter&utm_source=twitterfeed
http://mashable.com/2013/06/05/whistle/?utm_campaign=Feed:+Mashable+(Mashable)&utm_cid=Mash-Product-RSS-Pheedo-All-Partial&utm_medium=feed&utm_source=feedburner
http://mashable.com/2013/06/05/whistle/?utm_campaign=Mash-Product-RSS-Pheedo-All-Partial&utm_cid=Mash-Product-RSS-Pheedo-All-Partial&utm_medium=twitter&utm_source=dlvr.it

All of these point to same article, however, the way they differ is by some third party tracking variables. I can eliminate following using RegExp, but there could be unlimited variants. Also, I cannot drop entire query string as it could contain genuine variable (i.e. show.php?p=12)

utm_campaign
utm_cid
utm_medium
utm_source

Question: Is there a comprehensive list of these variables? Have you done this in past with a better approach?

ok but the http header of that article must be the same right , you can get it and then check it with the others — anshulkatta, Jun 06 '13 at 08:51

score 0 · Answer 1 · answered Jun 06 '13 at 08:51

0

Also useing RegExp

  [?&](.*?)=

in url every variable start with '?' or '&' and must be end with '='

answered Jun 06 '13 at 08:51

nomaka

58
5

yes I know that :). is there any comprehensive list of tracking cookie names? – Ankit Jain Jun 06 '13 at 08:54
see example above in the question. three examples of mashable – Ankit Jain Jun 06 '13 at 09:05

How would you clean up URL from tracking query string variables?

1 Answers1