3

Consider the sentence

msg = 'I got this URL https://stackoverflow.com/questions/47637005/handmade-estimator-modifies-parameters-in-init/47637293?noredirect=1#comment82268544_47637293 freed'

Next, I process the sentence using out-of-the-box spaCy for English:

import spacy
nlp = spacy.load('en')
doc = nlp(msg)

Let's review the output of: [(t, t.lemma_, t.pos_, t.tag_, t.dep_) for t in doc]:

[(I, '-PRON-', 'PRON', 'PRP', 'nsubj'),
 (got, 'get', 'VERB', 'VBD', 'ROOT'),
 (this, 'this', 'DET', 'DT', 'det'),
 (URL, 'url', 'NOUN', 'NN', 'compound'),
 (https://stackoverflow.com/questions/47637005/handmade-estimator-modifies-parameters-in-init/47637293?noredirect=1#comment82268544_47637293,
  'https://stackoverflow.com/questions/47637005/handmade-estimator-modifies-parameters-in-init/47637293?noredirect=1#comment82268544_47637293',
  'NOUN',
  'NN',
  'nsubj'),
 (freed, 'free', 'VERB', 'VBN', 'ccomp')]

I would like to improve the handling of the URL piece. In particular, I want to:

  1. Set its lemma to stackoverflow.com
  2. Set the tag to URL

How can I do it using spaCy? I want to use a regex (as suggested here) to decide whether a string is a URL or not and get the domain. So far, I failed to find the way to do it.

EDIT I guess a custom component is what I need. However, it seems like there's no way of placing a regex-based (or any other) callable as the patterns.

alvas
  • 115,346
  • 109
  • 446
  • 738
Dror
  • 12,174
  • 21
  • 90
  • 160
  • I'd suggest using [`urlparse.urlsplit`](https://docs.python.org/2/library/urlparse.html#urlparse.urlsplit) for URL handling and parsing. What you call *lemma* here would be stored in the result in `netloc` attribute. – Tomáš Linhart Jan 05 '18 at 11:16
  • @TomášLinhart Thanks for the pointer. Still, I don't understand how to enrich the `Doc` type yielded by `sapCy` – Dror Jan 05 '18 at 12:31
  • 2
    Either create a wrapper which includes URL parsing or a second pass for URL parsing. sPacy English models are trained on human language. URL specifications are artificial. They can be parsed efficiently with other libraries. – Nathan McCoy Jan 06 '18 at 09:55
  • Either create a wrapper which includes URL parsing or a second pass for URL parsing. sPacy English models are trained on human language. URL specifications are artificial. They can be parsed efficiently with other libraries. – Nathan McCoy Jan 06 '18 at 09:55

1 Answers1

7

Customized Regex for URL

You can specify the URL regex using a customized tokenizer, e.g. from https://spacy.io/usage/linguistic-features#native-tokenizers

import regex as re
from spacy.tokenizer import Tokenizer

prefix_re = re.compile(r'''^[\[\("']''')
suffix_re = re.compile(r'''[\]\)"']$''')
infix_re = re.compile(r'''[-~]''')
simple_url_re = re.compile(r'''^https?://''')

def custom_tokenizer(nlp):
    return Tokenizer(nlp.vocab, prefix_search=prefix_re.search,
                                suffix_search=suffix_re.search,
                                infix_finditer=infix_re.finditer,
                                token_match=simple_url_re.match)

nlp = spacy.load('en')
nlp.tokenizer = custom_tokenizer(nlp)

msg = 'I got this URL https://stackoverflow.com/questions/47637005/handmade-estimator-modifies-parameters-in-init/47637293?noredirect=1#comment82268544_47637293 freed'

for i, token in enumerate(nlp(msg)):
    print(i, ':\t', token)

[out]:

0 :  I
1 :  got
2 :  this
3 :  URL
4 :  https://stackoverflow.com/questions/47637005/handmade-estimator-modifies-parameters-in-init/47637293?noredirect=1#comment82268544_47637293
5 :  freed

Check if token is URL

You can check whether a token is like URL, e.g.

for i, token in enumerate(nlp(msg)):
    print(token.like_url, ':\t', token.lemma_)

[out]:

False :  -PRON-
False :  get
False :  this
False :  url
True :   https://stackoverflow.com/questions/47637005/handmade-estimator-modifies-parameters-in-init/47637293?noredirect=1#comment82268544_47637293
False :  free

Change tag if LIKE_URL

doc = nlp(msg)

for i, token in enumerate(doc):
    if token.like_url:
        token.tag_ = 'URL'

print([token.tag_ for token in doc])

[out]:

['PRP', 'VBD', 'DT', 'NN', 'URL', 'VBN']

Replace URL's lemma with customized lemma

Using the regex https://regex101.com/r/KfjQ1G/1 :

doc = nlp(msg)

for i, token in enumerate(doc):
    if re.match(r'(?:http[s]:\/\/)stackoverflow.com.*', token.lemma_):
        token.lemma_ = 'stackoverflow.com'

print([token.lemma_ for token in doc])

[out]:

['-PRON-', 'get', 'this', 'url', 'stackoverflow.com', 'free']
alvas
  • 115,346
  • 109
  • 446
  • 738
  • The first part, where you define a custom tokenizer, is great! However, the rest is rather manual and doesn't integrate into `spaCy`'s pipeline. How can I customize the pipeline such that the URL instances will be handled using rules based on `regex`? – Dror Jan 08 '18 at 10:41
  • Maybe you have to override the `nlp = spacy.load('en'); nlp.lemmatizer` and `nlp.tagger`? – alvas Jan 08 '18 at 15:51
  • I guess so, but I don't know how. – Dror Jan 08 '18 at 15:54