-1

I have some 1000s of URLs and the task is to count their occurrences and print top frequent. The problem occurs when single article has multiple URLs. Example below:

http://mashable.com/2013/06/05/whistle/?utm_campaign=Feed:+Mashable+(Mashable)&utm_cid=Mash-Product-RSS-Pheedo-All-Partial&utm_medium=twitter&utm_source=twitterfeed
http://mashable.com/2013/06/05/whistle/?utm_campaign=Feed:+Mashable+(Mashable)&utm_cid=Mash-Product-RSS-Pheedo-All-Partial&utm_medium=feed&utm_source=feedburner
http://mashable.com/2013/06/05/whistle/?utm_campaign=Mash-Product-RSS-Pheedo-All-Partial&utm_cid=Mash-Product-RSS-Pheedo-All-Partial&utm_medium=twitter&utm_source=dlvr.it

All of these point to same article, however, the way they differ is by some third party tracking variables. I can eliminate following using RegExp, but there could be unlimited variants. Also, I cannot drop entire query string as it could contain genuine variable (i.e. show.php?p=12)

utm_campaign
utm_cid
utm_medium
utm_source

Question: Is there a comprehensive list of these variables? Have you done this in past with a better approach?

Ankit Jain
  • 170
  • 4
  • ok but the http header of that article must be the same right , you can get it and then check it with the others – anshulkatta Jun 06 '13 at 08:51

1 Answers1

0

Also useing RegExp

  [?&](.*?)= 

in url every variable start with '?' or '&' and must be end with '='

nomaka
  • 58
  • 5