2

I'm using successfully pythonwhois (installed with pip install ...) to check the availability of .com domains:

import pythonwhois
for domain in ['aaa.com', 'bbb.com', ...]:
    details = pythonwhois.get_whois(domain)
    if 'No match for' in str(details):   # simple but it works!
        print domain 

But:

  • it is a little bit slow (2 requests per second on average)
  • won't I be blacklisted by the whois server if I do 26*26*26 ~ 17000 requests?
    (I'm testing availability of ???mail.com with ? being a..z)

Question: is there a better way to check availability than doing one whois request per domain?


Edit: The job finished in 9572 seconds, and here is the full list of all domains available of the form ???mail.com, as of November 2017, if anyone is interested to start an email service!

Basj
  • 41,386
  • 99
  • 383
  • 673

1 Answers1

2

You should parallelize what you are doing. Since most of the time spent by your function is waiting, you can verify a lot of works at once (not limited to your number of processors). Example:

import pythonwhois
from joblib import Parallel, delayed, cpu_count
n_jobs = 100 # works in parallel
def f(domain):
    details = pythonwhois.get_whois(domain)
    if 'No match for' in str(details):   # simple but it works!
        print(domain)
        return domain
    else:
        return None

domains= ['aaa.com', 'bbb.com', 'ccc.com', 'bbbaohecraoea.com']
result = Parallel(n_jobs=n_jobs, verbose=10)(delayed(f)(domain) for domain in domains)

# create a list with the available domains
available_domains=[domains[idx] for idx,r in enumerate(result) if r!=None]
print(available_domains)
# Result
# ['bbbaohecraoea.com']
silgon
  • 6,890
  • 7
  • 46
  • 67
  • Waw, this looks very nice, I never used `Parallel` before! (Just an error with your code: `ImportError: [joblib] Attempting to do parallel computing without protecting your import on a system that does not support forking. To use parallel-computing in a script, you must protect your main loop using "if __name__ == '__main__'". Please see the joblib documentation on Parallel for more information`) (I use Python 2.7 with Windows). – Basj Nov 21 '17 at 19:44
  • So I added `if __name__ == '__main__':`, and then it runs, but then I have no other output than `[Parallel(n_jobs=10)]: Done 5 tasks | elapsed: 1.7s [Parallel(n_jobs=10)]: Done 12 tasks | elapsed: 2.1s [Parallel(n_jobs=10)]: Done 21 tasks | elapsed: 2.5s [Parallel(n_jobs=10)]: Done 30 tasks | elapsed: 2.6s`. Any idea about how to actually display the results? I think the `print` don't have any effect. – Basj Nov 21 '17 at 19:46
  • I added a little more of code to show you the available domains. Also, in my case, there is no error while importing. It probably is a version problem or something. – silgon Nov 21 '17 at 20:29
  • Oh I see, it's not displayed on the fly, but at the end, when printing the `result`... Is there a way to print the domains on the fly (while doing it) before the end? It seems that the `print(domain)` in `f()` doesn't show. – Basj Nov 21 '17 at 20:33
  • Maybe it is because of the version you use. With my version under linux it works. But what you're interested is only the result. Just to play around I created [another script](https://pastebin.com/WwQ6U2PD), that creates 10000 random strings and checks it. – silgon Nov 21 '17 at 20:40
  • Ok, maybe the numbers were really ambitious. I changed to `n_jobs=100` and it worked fine for me. – silgon Nov 21 '17 at 20:53
  • Yes 100 is far better, 1000 produced a massive freeze / forkbomb (hundreds of python32.exe process in the task manager!) – Basj Nov 21 '17 at 21:09