how to extract a headline form a url?

Question

I have a dataset of headlines, such as

http://www.stackoverflow.com/lifestyle/tech/this-is-a-very-nice-headline-my-friend/2013/04/26/acjhrjk-2e1-1krjke4-9el8c-2eheje_story.html?tid=sm_fb

http://www.stackoverflow.com/2015/07/15/sports/baseball/another-very-nice.html?smid=tw-somedia&seid=auto

http://worldnews.stack.com/news/2013/07/22/54216-hello-another-one-here?lite

http://www.stack.com/article_email/hello-one-here-that-is-cool-1545545554-lMyQjAxMTAHFJELMDgxWj

http://www.stack.com/2013/11/13/tech/tricky-one/the-real-one/index.html

http://www.stack.com/2013/11/13/tech/the-good-one.html

http://www.stack.com/news/science-and-technology/54512-hello-world-here-is-a-weird-character#b02g07f20b14

I need to extract from these kind of links the proper headline, that is:

this-is-a-very-nice-headline-my-friend
another-very-nice
hello-another-one-here
hello-one-here-that-is-cool
the-real-one
the-good-one
hello-world-here-is-a-weird-character

so the rule seems to find the longest string of the form word1-word2-word3- that has a / at the right or left border and without considering

words with more than 3 digits (for instance acjhrjk-2e1-1krjke4-9el8c-2eheje in the first link, or 54216 in the third one ,
excluding stuff like .html.

How can I do that using regex in Python? I believe regex is the only viable solution here unfortunately. Packages such as yurl or urlparse can capture the path of the url, but then I am back to using regex to get the headline..

Many thanks!

Jan · Accepted Answer · 2016-06-22T21:06:12.133

After all, regular expressions might not be your best bet.
However, with the specifications you came up with, you could do the following:

import re

urls = ['http://www.stackoverflow.com/lifestyle/tech/this-is-a-very-nice-headline-my-friend/2013/04/26/acjhrjk-2e1-1krjke4-9el8c-2eheje_story.html?tid=sm_fb',
'http://www.stackoverflow.com/2015/07/15/sports/baseball/another-very-nice.html?smid=tw-somedia&seid=auto',
'http://worldnews.stack.com/news/2013/07/22/54216-hello-another-one-here?lite',
'http://www.stack.com/article_email/hello-one-here-that-is-cool-1545545554-lMyQjAxMTAHFJELMDgxWj',
'http://www.stack.com/2013/11/13/tech/tricky-one/the-real-one/index.html',
'http://www.stack.com/2013/11/13/tech/the-good-one.html',
'http://www.stack.com/news/science-and-technology/54512-hello-world-here-is-a-weird-character#b02g07f20b14']

regex = re.compile(r'(?<=/)([-\w]+)(?=[.?/#]|$)')
digits = re.compile(r'-?\d{3,}-?')

for url in urls:
    substrings = regex.findall(url)
    longest = max(substrings, key=len)
    headline = re.sub(digits, '', longest)
    print headline

This will print

 this-is-a-very-nice-headline-my-friend
 another-very-nice
 hello-another-one-here
 hello-one-here-that-is-coollMyQjAxMTAHFJELMDgxWj
 the-real-one
 the-good-one
 hello-world-here-is-a-weird-character

See a demo on ideone.com.

Explanation

Here, the regex uses lookarounds to look for a / behind and one of .?/# ahead. Any word character and dash in between is captured.
This is not very specific but if you're looking for the longest substring and eliminate more then three consecutive digits afterwards, it might be a good starting point.
As already said in the comments, you might perhaps be better off using linguistic tools.

also, why are you saying that regex might not be my best bet? — ℕʘʘḆḽḘ, Jun 22 '16 at 17:55
@Noobie: yup. But never used it before. Probably a combination of both of them. To give a better answer, provide more URLs. — Jan, Jun 22 '16 at 18:09
thanks again for your great answer! It seems your regex fails for the fourth and the last one however... do you see any solution? for instance I get `science-and-technology` for the last one. Recall that you can have a `slash` on the left **OR** a `slash` on the right, not necessarily both — ℕʘʘḆḽḘ, Jun 22 '16 at 19:27
thats really great!! can you just explain what `.?/#` means? it looks for zero or more slashes? — ℕʘʘḆḽḘ, Jun 22 '16 at 19:38
@Noobie: Nope: it looks for one of them (it is a character class after all). — Jan, Jun 22 '16 at 19:40

how to extract a headline form a url?

1 Answers1

Explanation