0

I need to put whois data in a table like

  • registrant,
  • created date,
  • expire date etc.

I've the script which is extracting data from whois servers, but the output is different for each domain extensions.

For example, for .com domains registrant details comes as a total address and for .org domains it comes as registrant name,street1,street2,street3 etc.

so i'm not able extract registrant details as a unit to be put in db.

some where i heard if we get as xml data we can able to extract it, can somebody help to get around this? Thanks!.

hakre
  • 193,403
  • 52
  • 435
  • 836
vkGunasekaran
  • 6,668
  • 7
  • 50
  • 59

3 Answers3

5

Actually the problem is a big larger than that.

  • there is no unified syntax for request
  • nor defined set of capabilities
  • there is no defined scheme for answers
  • local legislations make contents different
  • there is not sandardized error set
  • there is weak quality of the recorded information
  • you must deal with internalization

The WHOIS service is defined by RFC3912. It is a very basic request protocol that does not define the format of answered contents at all. So the answers often reflects the format of the database containing the data and you may get different syntax for each database. Since WHOIS can be use for whatever contents you want, you cannot make many assumptions about the format of answer you will get. Hopefully however, you can expect to receive parseable content, and similarly formatted answers for each request.

So you need to develop a parsing logic for each server which you will have to do in a very empirical manner.

However here a a few tips for your development that come from the RFC.

  • you need to send request using TCP port 43 with a single line ended by CR+LF ASCII characters

  • you must expect TCP end of connection as meaning the answer is finished, only.

About domain names specifically, you might be want to note that formerly restriction to ASCII encoding made some registrants to use Punycode to encode some (accentuated by example) strings in DNS systems, so you might want to be able to expect these in a Whois answer also if you meet in some replies. The existence of Internationalized Domain Names since 2003 will require you to support unicode encoding. Algorithms to converts names are complex, RFC 3490 should give you some useful details about this.

Good luck !

Sylvain
  • 66
  • 1
  • 2
  • definitively. there's no standard set of form for whois data. we must try to extract for possible tlds. because new tlds are getting introduced often... – vkGunasekaran Jun 28 '12 at 10:02
  • The crux of the problem, as already stated above is that there are now hundreds of TLDs and counting, to choose from - meaning every person making their own solution must write over one thousand WHOIS record parsers. This just isnt feasible if you want to provide a quality experience to your users. If you are set on rolling your own please stay away from regex, ended horribly for me. I moved to [jsonwhoisapi.com](https://jsonwhoisapi.com) for a hosted solution, since its so cheap, got to spend money to make money I guess! – sousdev Aug 19 '16 at 17:47
1

You need to detect the format ands use different regular expressions for them. alternatively as you mentioned you can use XML or even JSON APIs http://whoisxmlapi.com/ http://www.domaintools.com/api/docs/

Neel Basu
  • 12,638
  • 12
  • 82
  • 146
0

You need to extend your database and processing to better deal with the problem.

The data provided by the remote service is in different format as you've already noted. So you need to separate the concerns of fetching the data and parsing it, because both things are independent to each other. For example, the format for one TLD can change over time.

So first of all you fetch the plain text data per domain and store it's meta-data:

  • domain
  • whois server
  • timestamp of fetch operation
  • response
  • status code (if the protocol has this)

You can then later on within a second processing do the parsing. You can use the metadata that already exists to decide which parsing algorithm you need. That helps you to maintain your application over time as well.

After parsing went right, you've got the normalized format which is what you aim for.

Next to these technical processings, you should take care of the usage conditions offered by the whois service(s). Not everything that is technically possible, is legally or morally accepted. Take care and treat other persons personal records with the respect this deserves. Protect the data you collect, e.g. archive and scramble / lock-away data you don't need any longer for your on going processing.

See as well:

Community
  • 1
  • 1
hakre
  • 193,403
  • 52
  • 435
  • 836