Say you have a webserver log (apache, nginx, whatever). From it you extract a large list of URLs:
/article/1/view
/article/2/view
/article/1/view
/article/1323/view
/article/1/edit
/help
/article/1/view
/contact
/contact/thank-you
/article/8/edit
...
or
/blog/2012/06/01/how-i-will-spend-my-summer-vacation
/blog/2012/08/30/how-i-wasted-my-summer-vacation
...
You explode these urls into their pieces such that you have ['article', '1323', 'view'] or ['blog', '2012', '08', '30', 'how-i-wasted-my-summer-vacation'].
How would one go about analyzing and comparing these urls to detect and call out "variables" in the url path. That is to say, you would want to recognize things like /article/XXX/view
, /article/XXX/edit
, and /blog/XXX/XXX/XXX/XXX
such that you can summarize information about those lines in the logs.
I assume that there will need to be some statistical threshold for the number of differences that constitute a mutable piece vs a similar looking but different template. I am also unsure as to what data structure would make this analysis quick and easy.
I would like the output of the script to output what it thinks are all the url templates that are present on the server, possibly with some confidence value if appropriate.