1

I'd like to add a link to every word in a text.

Example text:
"He's <i>certain</i> in America's “West,” it could’ve been possible for gunfights to erupt at any time anywhere," he said holding a gun in his hand.

Desired result:

"<a href='xxx.com?word=he'>He</a>'s
 <i><a href='xxx.com?word=certain'>certain</a></i>
 <a href='xxx.com?word=in'>in</a>
 <a href='xxx.com?word=america'>America</a>'s
 “<a href='xxx.com?word=west'>West</a>,”
 <a href='xxx.com?word=it'>it</a>
 <a href='xxx.com?word=could'>could</a>'ve
.... etc

(I split the output into multiple lines to make it easier to read here. The actual output should be all one string, e.g.:

 "<a href='xxx.com?word=he'>He</a>'s <i><a href='xxx.com?word=certain'>certain</a></i> <a href='xxx.com?word=in'>in</a> <a href='xxx.com?word=america'>America</a>'s “<a href='xxx.com?word=west'>West</a>,” <a href='xxx.com?word=it'>it</a> <a href='xxx.com?word=could'>could</a>'ve ... etc

Each word should have a link which is the word itself stripped of punctuation and contractions. Links are lower case. Punctuation and contractions shouldn't get links. Words and punctuation are utf-8 with many Unicode characters. The only html element it will encounter is <i>and</i>, so it's not html parsing, just that one tag pair. The link should be on the word inside the <i><--></i> tags.

My code below worked for simple test cases, but it has problems for real texts which are longer and have repeating words and <i> tags:

# -*- coding: utf-8 -*-
import re

def addLinks(s):
    #adds a link to dictionary for every word in text
    link = "xxx.com?word="

    #strip out 's, 'd, 'l, 'm, 've, 're
    #then split on punctuation
    words = filter(None, re.split("[, \-!?:_;\"“”‘’‹›«»]+",  re.sub("'[(s|d|l|m|(ve)|(re)]? ", " ", s)))
    for w in words:
        linkedWord = "<a href=#'" + link + w.lower() + "'>" + w + "</a>"
        s = s.replace(w,linkedWord,1)
    return s

s = """
"I'm <i>certain</i> in America's “West,” it could’ve been possible for gunfights to erupt at any time anywhere," he said holding a gun in his hand.
"""
print addLinks(s)

My problems:

  • How to deal with words repeated in a sentence, either exact repetitions ("in"<->"in"), or with punctuation and/or capitalization("He's"<->"he"), or partial words ("gun"<->"gunfight", "any"<->"anywhere,"). It'd be easier if it were split on spaces exactly, but by stripping contractions and then splitting on punctuation, I can't figure out how to cleanly substitute the linked words back into the string.
  • My regex to get rid of contractions works for single letters like 'm and 'd, but doesn't work for 've and 're.
  • I can't figure out how to deal with <i> tags, for example to make <i>certain</i> into <i><a href="xxx.com?word=certain">certain</a></i>

I'm doing this in Python 2.7, but this answer for javascript is similar and works with Unicode, but doesn't account for my issues like punctuation.

Community
  • 1
  • 1
S.Yarowat
  • 13
  • 3
  • How does your code not deal with capitalization and repeated words? (I.e., what do you get now?) At a glance, a simple substitution such as this `re.sub(r"(?!i>)(\w+)", r"\1", s)` should work just nice. – Jongware May 10 '16 at 08:29
  • Natural language has a lot of demons, don't reinvent the wheel. Look into using [Python's Natural Language Toolkit](http://www.nltk.org/). – machine yearning May 10 '16 at 08:54
  • @RadLexus: My repetition problem is because I was using a loop to replace each found word, so it found "gun" in "gunfight", which is not what I want. Thanks for your clever idea: it works pretty well, but it makes links to contractions like 've, 'd; as in the example, I'm trying to not have links on them. Also, the href link has to be all-lowercase of the word, while the text stays exactly as it is everywhere. – S.Yarowat May 10 '16 at 09:33

1 Answers1

1

Regular expressions can help you out.

To match words, of any length, you can use \w+. To ignore the single tags <i> and </i>, you can add a lookahead: (?!>). This will match both the open and close tags. Finally, to ignore the right hand side of contractions, you can add a lookbehind before the match proper: (?<!').

To insert a lowercase version of the found pattern, use a callback function (from Using a regular expression to replace upper case repeated letters in python with a single lowercase letter). The callback lambda function inserts the lowercase version of the found match, surrounded by the <a= codes, and constructs the entire replacement string at once.

That leads us to

import re

s = """
"I'm <i>certain</i> in America's “West,” it could’ve been possible for gunfights
to erupt at any time anywhere," he said holding a gun in his hand.
"""

callback = lambda pat: '<a href="xxx.com?word='+pat.group(1).lower()+'">'+pat.group(1)+'</a>'
result = re.sub(r"(?<!')(?!i>)(\w+)", callback, s)

where result will end up as

"<a href="xxx.com?word=i">I</a>'m <i><a href="xxx.com?word=certain">
certain</a></i> <a href="xxx.com?word=in">in</a> <a href="xxx.com?
word=america">America</a>'s "<a href="xxx.com?word=west">West</a>," ...
Community
  • 1
  • 1
Jongware
  • 22,200
  • 8
  • 54
  • 100
  • Wow, very nice! Thanks for the clever code, regex genie :). And thanks for the explanation, I learned a lot. – S.Yarowat May 10 '16 at 11:03