With emergence of new TLDs (.club, .jobs, etc...) what is the current best practice for extracting/parsing domains from text? My typical approach is regex however given that things like file names with extensions will trigger false positives, I will need something more restrictive.
I noticed even google sometimes does not properly recognize if I'm searching for a file name or want to go to a domain. This appears to be a rather challenging problem. Machine Learning could potentially be an approach to understand the context surrounding a string. However unless there is a library that does this already I won't bother getting too fancy.
One approach I'm thinking of is after regexing, querying http://data.iana.org/TLD/tlds-alpha-by-domain.txt which holds a static list of current TLDs and use it as a filter. Any suggestions?