I am trying to identify new patterns after analyzing a number of URLs. So let's say, I am investigating the hypothetical website Yoohle.com and their URLs have the following structure.
- domain = yoohle.com
- q= search phrase
- lan= language used
- pr= partner_id
- br= browser_id
so a sample url will look like this
www.yoohle.com/test_folder/test_page?q=hello+world&lan=en&pr=stackoverflow&br=chrome
If I am investigating the web traffic of this website and seeing abnormal increase month over month, I would like to find out what's causing this. In this example I can just parse out the URL and look at the pr= value since it will tell me if there is a new partnership (maybe stackoverflow is going to be powered by yoohle.com and that drives the increase etc.)
The question is, how can I build something robust that can compare 2 (or more) months and tell me exactly what's driving the increase. I want to get something like, "we are seeing an increase and it is driven by the following pattern"
www.yoohle.com/test_folder/test_page%pr=stackoverflow%
The tricky part is, you do not know anything about what the tokens mean unlike this example since I will not know what token stands for partner_id. Another issue is, if we look at token by token, this will be misleading because lan=en will also go up with a new partner assuming the users will still have English as the language.
My idea is to analyze the tokens by looking at all the combinations but it is very costly, (4! in this example and probably 10+! for other websites). Also analyzing tokens itself is not going to solve the problem since I still need to analyze the values of the tokens.
I tried k-means clustering, apriori algorithm did some research on URL/text mining but could not get what I want. Any ideas about how to approach building an algorithm will be beneficial.
Imagine that you are seeing realtime data, so we are talking about analyzing around 100K URLs in a given month.