Add link to every word, accounting for punctuation, contractions, and Unicode

Question

I'd like to add a link to every word in a text.

Example text:
"He's certain in America's “West,” it could’ve been possible for gunfights to erupt at any time anywhere," he said holding a gun in his hand.

Desired result:

"<a href='xxx.com?word=he'>He</a>'s
 <i><a href='xxx.com?word=certain'>certain</a></i>
 <a href='xxx.com?word=in'>in</a>
 <a href='xxx.com?word=america'>America</a>'s
 “<a href='xxx.com?word=west'>West</a>,”
 <a href='xxx.com?word=it'>it</a>
 <a href='xxx.com?word=could'>could</a>'ve
.... etc

(I split the output into multiple lines to make it easier to read here. The actual output should be all one string, e.g.:

 "<a href='xxx.com?word=he'>He</a>'s <i><a href='xxx.com?word=certain'>certain</a></i> <a href='xxx.com?word=in'>in</a> <a href='xxx.com?word=america'>America</a>'s “<a href='xxx.com?word=west'>West</a>,” <a href='xxx.com?word=it'>it</a> <a href='xxx.com?word=could'>could</a>'ve ... etc

Each word should have a link which is the word itself stripped of punctuation and contractions. Links are lower case. Punctuation and contractions shouldn't get links. Words and punctuation are utf-8 with many Unicode characters. The only html element it will encounter is and, so it's not html parsing, just that one tag pair. The link should be on the word inside the <--> tags.

My code below worked for simple test cases, but it has problems for real texts which are longer and have repeating words and  tags:

# -*- coding: utf-8 -*-
import re

def addLinks(s):
    #adds a link to dictionary for every word in text
    link = "xxx.com?word="

    #strip out 's, 'd, 'l, 'm, 've, 're
    #then split on punctuation
    words = filter(None, re.split("[, \-!?:_;\"“”‘’‹›«»]+",  re.sub("'[(s|d|l|m|(ve)|(re)]? ", " ", s)))
    for w in words:
        linkedWord = "<a href=#'" + link + w.lower() + "'>" + w + "</a>"
        s = s.replace(w,linkedWord,1)
    return s

s = """
"I'm <i>certain</i> in America's “West,” it could’ve been possible for gunfights to erupt at any time anywhere," he said holding a gun in his hand.
"""
print addLinks(s)

My problems:

How to deal with words repeated in a sentence, either exact repetitions ("in"<->"in"), or with punctuation and/or capitalization("He's"<->"he"), or partial words ("gun"<->"gunfight", "any"<->"anywhere,"). It'd be easier if it were split on spaces exactly, but by stripping contractions and then splitting on punctuation, I can't figure out how to cleanly substitute the linked words back into the string.
My regex to get rid of contractions works for single letters like 'm and 'd, but doesn't work for 've and 're.
I can't figure out how to deal with  tags, for example to make certain into <a href="xxx.com?word=certain">certain</a>

I'm doing this in Python 2.7, but this answer for javascript is similar and works with Unicode, but doesn't account for my issues like punctuation.

How does your code not deal with capitalization and repeated words? (I.e., what do you get now?) At a glance, a simple substitution such as this `re.sub(r"(?!i>)(\w+)", r"\1", s)` should work just nice. — Jongware, May 10 '16 at 08:29
Natural language has a lot of demons, don't reinvent the wheel. Look into using [Python's Natural Language Toolkit](http://www.nltk.org/). — machine yearning, May 10 '16 at 08:54
@RadLexus: My repetition problem is because I was using a loop to replace each found word, so it found "gun" in "gunfight", which is not what I want. Thanks for your clever idea: it works pretty well, but it makes links to contractions like 've, 'd; as in the example, I'm trying to not have links on them. Also, the href link has to be all-lowercase of the word, while the text stays exactly as it is everywhere. — S.Yarowat, May 10 '16 at 09:33

score 1 · Accepted Answer · edited May 23 '17 at 10:33

Regular expressions can help you out.

To match words, of any length, you can use \w+. To ignore the single tags  and , you can add a lookahead: (?!>). This will match both the open and close tags. Finally, to ignore the right hand side of contractions, you can add a lookbehind before the match proper: (?<!').

To insert a lowercase version of the found pattern, use a callback function (from Using a regular expression to replace upper case repeated letters in python with a single lowercase letter). The callback lambda function inserts the lowercase version of the found match, surrounded by the <a= codes, and constructs the entire replacement string at once.

That leads us to

import re

s = """
"I'm <i>certain</i> in America's “West,” it could’ve been possible for gunfights
to erupt at any time anywhere," he said holding a gun in his hand.
"""

callback = lambda pat: '<a href="xxx.com?word='+pat.group(1).lower()+'">'+pat.group(1)+'</a>'
result = re.sub(r"(?<!')(?!i>)(\w+)", callback, s)

where result will end up as

"<a href="xxx.com?word=i">I</a>'m <i><a href="xxx.com?word=certain">
certain</a></i> <a href="xxx.com?word=in">in</a> <a href="xxx.com?
word=america">America</a>'s "<a href="xxx.com?word=west">West</a>," ...

Wow, very nice! Thanks for the clever code, regex genie :). And thanks for the explanation, I learned a lot. — S.Yarowat, May 10 '16 at 11:03

Add link to every word, accounting for punctuation, contractions, and Unicode

1 Answers1