how to handle whois data

Question

I need to put whois data in a table like

registrant,
created date,
expire date etc.

I've the script which is extracting data from whois servers, but the output is different for each domain extensions.

For example, for .com domains registrant details comes as a total address and for .org domains it comes as registrant name,street1,street2,street3 etc.

so i'm not able extract registrant details as a unit to be put in db.

some where i heard if we get as xml data we can able to extract it, can somebody help to get around this? Thanks!.

you should use different regular expressions to get the data you need. I don't think there's a 'one-solution-fits-all' proces here. — Joshua - Pendo, May 06 '11 at 11:37
http://whoisxmlapi.com/ http://www.domaintools.com/api/docs/ — Neel Basu, May 06 '11 at 11:39

score 5 · Accepted Answer · answered Jun 28 '12 at 08:56

Actually the problem is a big larger than that.

there is no unified syntax for request
nor defined set of capabilities
there is no defined scheme for answers
local legislations make contents different
there is not sandardized error set
there is weak quality of the recorded information
you must deal with internalization

The WHOIS service is defined by RFC3912. It is a very basic request protocol that does not define the format of answered contents at all. So the answers often reflects the format of the database containing the data and you may get different syntax for each database. Since WHOIS can be use for whatever contents you want, you cannot make many assumptions about the format of answer you will get. Hopefully however, you can expect to receive parseable content, and similarly formatted answers for each request.

So you need to develop a parsing logic for each server which you will have to do in a very empirical manner.

However here a a few tips for your development that come from the RFC.

you need to send request using TCP port 43 with a single line ended by CR+LF ASCII characters
you must expect TCP end of connection as meaning the answer is finished, only.

About domain names specifically, you might be want to note that formerly restriction to ASCII encoding made some registrants to use Punycode to encode some (accentuated by example) strings in DNS systems, so you might want to be able to expect these in a Whois answer also if you meet in some replies. The existence of Internationalized Domain Names since 2003 will require you to support unicode encoding. Algorithms to converts names are complex, RFC 3490 should give you some useful details about this.

Good luck !

definitively. there's no standard set of form for whois data. we must try to extract for possible tlds. because new tlds are getting introduced often... — vkGunasekaran, Jun 28 '12 at 10:02
The crux of the problem, as already stated above is that there are now hundreds of TLDs and counting, to choose from - meaning every person making their own solution must write over one thousand WHOIS record parsers. This just isnt feasible if you want to provide a quality experience to your users. If you are set on rolling your own please stay away from regex, ended horribly for me. I moved to [jsonwhoisapi.com](https://jsonwhoisapi.com) for a hosted solution, since its so cheap, got to spend money to make money I guess! — sousdev, Aug 19 '16 at 17:47

score 1 · Answer 2 · answered May 06 '11 at 11:41

1

You need to detect the format ands use different regular expressions for them. alternatively as you mentioned you can use XML or even JSON APIs http://whoisxmlapi.com/ http://www.domaintools.com/api/docs/

answered May 06 '11 at 11:41

Neel Basu

12,638
12
82
146

I'm expecting an open source.. above services are not free. – vkGunasekaran Jun 15 '11 at 13:11
2

Are you expecting? You got your code running? Where is the source now? – hakre Jun 28 '12 at 10:33

score 0 · Answer 3 · edited Oct 07 '21 at 06:26

You need to extend your database and processing to better deal with the problem.

The data provided by the remote service is in different format as you've already noted. So you need to separate the concerns of fetching the data and parsing it, because both things are independent to each other. For example, the format for one TLD can change over time.

So first of all you fetch the plain text data per domain and store it's meta-data:

domain
whois server
timestamp of fetch operation
response
status code (if the protocol has this)

You can then later on within a second processing do the parsing. You can use the metadata that already exists to decide which parsing algorithm you need. That helps you to maintain your application over time as well.

After parsing went right, you've got the normalized format which is what you aim for.

Next to these technical processings, you should take care of the usage conditions offered by the whois service(s). Not everything that is technically possible, is legally or morally accepted. Take care and treat other persons personal records with the respect this deserves. Protect the data you collect, e.g. archive and scramble / lock-away data you don't need any longer for your on going processing.

See as well:

RFC3912 WHOIS Protocol Specification

how to handle whois data

3 Answers3