I'd like to add a link to every word in a text.
Example text:
"He's <i>certain</i> in America's “West,” it could’ve been possible for gunfights to erupt at any time anywhere," he said holding a gun in his hand.
Desired result:
"<a href='xxx.com?word=he'>He</a>'s
<i><a href='xxx.com?word=certain'>certain</a></i>
<a href='xxx.com?word=in'>in</a>
<a href='xxx.com?word=america'>America</a>'s
“<a href='xxx.com?word=west'>West</a>,”
<a href='xxx.com?word=it'>it</a>
<a href='xxx.com?word=could'>could</a>'ve
.... etc
(I split the output into multiple lines to make it easier to read here. The actual output should be all one string, e.g.:
"<a href='xxx.com?word=he'>He</a>'s <i><a href='xxx.com?word=certain'>certain</a></i> <a href='xxx.com?word=in'>in</a> <a href='xxx.com?word=america'>America</a>'s “<a href='xxx.com?word=west'>West</a>,” <a href='xxx.com?word=it'>it</a> <a href='xxx.com?word=could'>could</a>'ve ... etc
Each word should have a link which is the word itself stripped of punctuation and contractions. Links are lower case. Punctuation and contractions shouldn't get links. Words and punctuation are utf-8 with many Unicode characters. The only html element it will encounter is <i>
and</i>
, so it's not html parsing, just that one tag pair. The link should be on the word inside the <i>
<--></i>
tags.
My code below worked for simple test cases, but it has problems for real texts which are longer and have repeating words and <i>
tags:
# -*- coding: utf-8 -*-
import re
def addLinks(s):
#adds a link to dictionary for every word in text
link = "xxx.com?word="
#strip out 's, 'd, 'l, 'm, 've, 're
#then split on punctuation
words = filter(None, re.split("[, \-!?:_;\"“”‘’‹›«»]+", re.sub("'[(s|d|l|m|(ve)|(re)]? ", " ", s)))
for w in words:
linkedWord = "<a href=#'" + link + w.lower() + "'>" + w + "</a>"
s = s.replace(w,linkedWord,1)
return s
s = """
"I'm <i>certain</i> in America's “West,” it could’ve been possible for gunfights to erupt at any time anywhere," he said holding a gun in his hand.
"""
print addLinks(s)
My problems:
- How to deal with words repeated in a sentence, either exact repetitions ("in"<->"in"), or with punctuation and/or capitalization("He's"<->"he"), or partial words ("gun"<->"gunfight", "any"<->"anywhere,"). It'd be easier if it were split on spaces exactly, but by stripping contractions and then splitting on punctuation, I can't figure out how to cleanly substitute the linked words back into the string.
- My regex to get rid of contractions works for single letters like 'm and 'd, but doesn't work for 've and 're.
- I can't figure out how to deal with
<i>
tags, for example to make<i>certain</i>
into<i><a href="xxx.com?word=certain">certain</a></i>
I'm doing this in Python 2.7, but this answer for javascript is similar and works with Unicode, but doesn't account for my issues like punctuation.