Python domains extraction from text - new TLDs recognition issues

Question

With emergence of new TLDs (.club, .jobs, etc...) what is the current best practice for extracting/parsing domains from text? My typical approach is regex however given that things like file names with extensions will trigger false positives, I will need something more restrictive.

I noticed even google sometimes does not properly recognize if I'm searching for a file name or want to go to a domain. This appears to be a rather challenging problem. Machine Learning could potentially be an approach to understand the context surrounding a string. However unless there is a library that does this already I won't bother getting too fancy.

One approach I'm thinking of is after regexing, querying http://data.iana.org/TLD/tlds-alpha-by-domain.txt which holds a static list of current TLDs and use it as a filter. Any suggestions?

I think your final approach is quite fruitful. You'll never get 100% recall, but your precision should be high. — Private, May 03 '17 at 13:01

score 0 · Answer 1 · answered Apr 15 '18 at 01:20

This is not an easy problem and it depends on the context in which you need to extract the domain names, and the accepted rate of false positives and negatives you can support. You can indeed use the list of currently existing TLDs but this list changes so you need to make sure you are taking into account recent enought values of the list.

You are hitting issues covered by the Universal Acceptance movement, in trying to make sure all TLDs (whatever length, date of creation, and characters it uses) are equal.

They provided a document about "Linkification" which has as a sub problem the fact of extracting the links hence the domain among other things. Have a look at their documentation: https://uasg.tech/wp-content/uploads/2017/06/UASG010-Quick-Guide-to-Linkification.pdf

So this could give you some ideas, as well as their Quick Guide at https://uasg.tech/wp-content/uploads/2016/06/UASG005-160302-en-quickguide-digital.pdf

Python domains extraction from text - new TLDs recognition issues

1 Answers1