Consider the sentence
msg = 'I got this URL https://stackoverflow.com/questions/47637005/handmade-estimator-modifies-parameters-in-init/47637293?noredirect=1#comment82268544_47637293 freed'
Next, I process the sentence using out-of-the-box spaCy
for English:
import spacy
nlp = spacy.load('en')
doc = nlp(msg)
Let's review the output of: [(t, t.lemma_, t.pos_, t.tag_, t.dep_) for t in doc]
:
[(I, '-PRON-', 'PRON', 'PRP', 'nsubj'),
(got, 'get', 'VERB', 'VBD', 'ROOT'),
(this, 'this', 'DET', 'DT', 'det'),
(URL, 'url', 'NOUN', 'NN', 'compound'),
(https://stackoverflow.com/questions/47637005/handmade-estimator-modifies-parameters-in-init/47637293?noredirect=1#comment82268544_47637293,
'https://stackoverflow.com/questions/47637005/handmade-estimator-modifies-parameters-in-init/47637293?noredirect=1#comment82268544_47637293',
'NOUN',
'NN',
'nsubj'),
(freed, 'free', 'VERB', 'VBN', 'ccomp')]
I would like to improve the handling of the URL piece. In particular, I want to:
- Set its
lemma
tostackoverflow.com
- Set the
tag
toURL
How can I do it using spaCy
? I want to use a regex (as suggested here) to decide whether a string is a URL or not and get the domain. So far, I failed to find the way to do it.
EDIT I guess a custom component is what I need. However, it seems like there's no way of placing a regex-based (or any other) callable as the patterns
.