I have a dataset of headlines, such as
http://www.stackoverflow.com/lifestyle/tech/this-is-a-very-nice-headline-my-friend/2013/04/26/acjhrjk-2e1-1krjke4-9el8c-2eheje_story.html?tid=sm_fb
http://www.stackoverflow.com/2015/07/15/sports/baseball/another-very-nice.html?smid=tw-somedia&seid=auto
http://worldnews.stack.com/news/2013/07/22/54216-hello-another-one-here?lite
http://www.stack.com/article_email/hello-one-here-that-is-cool-1545545554-lMyQjAxMTAHFJELMDgxWj
http://www.stack.com/2013/11/13/tech/tricky-one/the-real-one/index.html
http://www.stack.com/2013/11/13/tech/the-good-one.html
http://www.stack.com/news/science-and-technology/54512-hello-world-here-is-a-weird-character#b02g07f20b14
I need to extract from these kind of links the proper headline, that is:
- this-is-a-very-nice-headline-my-friend
- another-very-nice
- hello-another-one-here
- hello-one-here-that-is-cool
- the-real-one
- the-good-one
- hello-world-here-is-a-weird-character
so the rule seems to find the longest string of the form word1-word2-word3
- that has a /
at the right or left border and without considering
- words with more than 3 digits (for instance
acjhrjk-2e1-1krjke4-9el8c-2eheje
in the first link, or54216
in the third one , - excluding stuff like
.html
.
How can I do that using regex in Python? I believe regex is the only viable solution here unfortunately. Packages such as yurl
or urlparse
can capture the path of the url, but then I am back to using regex to get the headline..
Many thanks!