0

Should I normalize the domain name to split the domain names, subdomains, tld? I will be adding about around 100 domains/subdomains per second and querying about 500 domains/subdomains per second.

I have a plan where I can have table for tlds http://data.iana.org/TLD/tlds-alpha-by-domain.txt

I can have another table for domain name and another one for subdomains

The fact is that I have an online site uptime service, and I want to have uptimes for all possible domains for each day with checking around 100 per second and crawling the web to find more.

What would be the best structure to follow.

Brian Tompsett - 汤莱恩
  • 5,753
  • 72
  • 57
  • 129
Vish
  • 4,508
  • 10
  • 42
  • 74

2 Answers2

1

I would use the full exact hostname (e.g. www.stackoverflow.com and stackoverflow.com are different). For some sites, two particular hostnames may be equivalent, but for others they won't be. I also don't see how tracking the TLDs will be useful (particularly after the upcoming TLD explosion).

I can see why you want to categorize it by domain, but bear in mind two different pages (http://example.com/store and http://example.com/wiki) could be setup totally different (e.g. different programming languages and databases), so one could easily be down while the other is running fine. Users will want this information on a per-URL basis.

Matthew Flaschen
  • 278,309
  • 50
  • 514
  • 539
  • thanks for that. But how will I go about storing urls, where lets say I will have about 200 million urls in the future. Right now my servers have crawled and are tracking about 5 million urls in one table but lookups and inserts are slowing down with database having crashed 25 times in last 2 weeks for atleast 1 hour each time. – Vish Jun 20 '12 at 18:57
  • 2
    You may want to use a non-relational (NoSQL) database. – Matthew Flaschen Jun 20 '12 at 18:59
0

If you store only the full host name, it seems like it would be difficult to run efficient queries for e.g. *.stackoverflow.com. Substring matching won't be able to take advantage of any indexes on the field. On the other hand, storing the full string is easier, and the less efficient queries might not be a problem for a very long time.

bmm6o
  • 6,187
  • 3
  • 28
  • 55