I have some 1000s of URLs and the task is to count their occurrences and print top frequent. The problem occurs when single article has multiple URLs. Example below:
http://mashable.com/2013/06/05/whistle/?utm_campaign=Feed:+Mashable+(Mashable)&utm_cid=Mash-Product-RSS-Pheedo-All-Partial&utm_medium=twitter&utm_source=twitterfeed
http://mashable.com/2013/06/05/whistle/?utm_campaign=Feed:+Mashable+(Mashable)&utm_cid=Mash-Product-RSS-Pheedo-All-Partial&utm_medium=feed&utm_source=feedburner
http://mashable.com/2013/06/05/whistle/?utm_campaign=Mash-Product-RSS-Pheedo-All-Partial&utm_cid=Mash-Product-RSS-Pheedo-All-Partial&utm_medium=twitter&utm_source=dlvr.it
All of these point to same article, however, the way they differ is by some third party tracking variables. I can eliminate following using RegExp, but there could be unlimited variants. Also, I cannot drop entire query string as it could contain genuine variable (i.e. show.php?p=12
)
utm_campaign
utm_cid
utm_medium
utm_source
Question: Is there a comprehensive list of these variables? Have you done this in past with a better approach?