6

I'm searching a way to find "all" the sites ending with an given tld. I had several ideas on how to realize that, but I'm not sure what is the best/most effectiv way to realize this. I'm aware that pages that are linked nowhere aren't findable by spiders etc, so fir this example I'll not care about isolated pages. What I want to do, I want to have an TLD as input for my programm, and I which to have a list of sites as output. For example:

# <program> .de
- spiegel.de
- deutsche-bank.de
...
- bild.de

So what is the best way to reach this? Are there tools available to help me, or how would you program this?

naXa stands with Ukraine
  • 35,493
  • 19
  • 190
  • 259
user1620678
  • 61
  • 1
  • 2
  • Sure? DNS Zone transfer could give you the list if and only if you are authorized to do a AXFR http://en.wikipedia.org/wiki/DNS_zone_transfer – rene Aug 23 '12 at 19:26
  • Hello Rene, thx for your answer. I did some research on your post and I'm able to perform such AXFR queries for one domain, now I'm unsure how I would do it for an entire TLD, I used dig for my tests. Are there better tools? – user1620678 Aug 25 '12 at 16:39
  • AFAIK the DNS servers in the wild don't allow AXFR commands for non-authorative servers, which you and I probably have. If such a tool exist dig should be up to the task. – rene Aug 25 '12 at 18:54

2 Answers2

8

This answer might be a bit late but I've just found this.

You could try using Common Crawler awesome data.

So, what is Common Crawler?

Common Crawl is a 501(c)(3) non-profit organization dedicated to providing a copy of the internet to internet researchers, companies and individuals at no cost for the purpose of research and analysis.

Using their url search tool query for .de then download the result as a json file.

You will get a nice file of results then you will need to do some work on it since it includes all the site map of a domain (hence crawling).

Another drawback that some sites use unwelcoming robot.txt file so crawlers won't be included them still it's the best result i could find so far.

Nimir
  • 5,727
  • 1
  • 26
  • 34
1

The code bellow is a multithreaded domain checker script in python3 that uses something like a brute-force string generator that is appended to a list and that list has all the possible combinations (depending on the length that is specified)of chars maybe you need to add some characters to it. I successfully used it for Chinese, Russian, Dutch sites.

from multiprocessing.pool import ThreadPool
from urllib.request import urlopen
import pandas as pd

from itertools import product

chars = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890' # add all chars in your language
urls = []# list
for length in range(0, 9999): # Change this length 
    to_attempt = product(chars, repeat=length)
    for attempt in to_attempt:
        a=("https://"+''.join(attempt)+".de")
        urls.append(a)


import sys
sys.stdout = open('de.csv','wt')
def fetch_url(url):
    try:
        response = urlopen(url)
        return url, response.read(), None
    except Exception as e:
        return url, None, e

start = timer()
results = ThreadPool(4000).imap_unordered(fetch_url, urls)
for url, html, error in results:
    if error is None:
        print(url)