I have a database of 20000 domain names, including Top Level Domains, Second-level and lower level domains. For example
.biz stackoverflow.com ru.wikipedia.com
I would like to perform fast lookups to see if input URLs match any of these 20000. I could use Dictionary key or HashSet.Contains, but it is good for exact matches only. As the database contains also TLD names I would like acmeycompany.biz also return as match because of .biz TLD. On the other hand, fr.wikipedia.com should not match, because sub-domain is different.
Simply looping over the list and doing string based comparison is also not an option. If I have 1000 URLs to compare it's simply too slow. So it must be key based index lookups.
I was thinking of building a tree structure like below and then do key based look-ups, for example:
.com .wikipedia .ru .stackoverflow .biz
Then I can split input Url (sampledomain.com) to parts and do lookups like this .com -> .sampledomain
Could anybody point me to a sample how to do it? Or what are other alternatives? Any samples are appreciated.
Thanks!
This is how I started... It is vb.net code but you get the idea.
Public Class TreeNode
Sub New()
ChildNodes = New Dictionary(Of String, TreeNode)
End Sub
Public Property Key As String
Public Property ChildNodes As Dictionary(Of String, TreeNode)
End Class
Private Tree As New Dictionary(Of String, TreeNode)
Sub BuildTree()
For Each Link In Links
If Uri.IsWellFormedUriString(Link, UriKind.Absolute) Then
Dim Url As New Uri(Link)
Dim Domain As String
If Url.HostNameType = UriHostNameType.Dns Then
Domain = Url.Host.ToLower.Replace("www.", "")
Dim DomainParts() As String = Domain.Split(CChar("."))
'lets start from TLD
For Each Key In DomainParts.Reverse
'dont konw how to populate tree
Next
End If
End If
Next
End Sub
Function TreeLookup(Link As String) As Boolean
Dim Url As New Uri(Link)
Dim Domain As String
Dim IsMatch As Boolean = False
If Url.HostNameType = UriHostNameType.Dns Then
Domain = Url.Host.ToLower.Replace("www.", "")
Dim DomainParts() As String = Domain.Split(CChar("."))
Dim DomainPartsCount As Integer = DomainParts.Length
Dim Index As Integer = 0
For Each Key In DomainParts
Index += 1
'use recursive loop to see if
'returns true if directory contains key and we have reached to the last part of the domain name
If Index = DomainPartsCount Then
IsMatch = True
Exit For
End If
Next
End If
Return IsMatch
End Function