I need to parse WHOIS raw data records into fields. There is no one consistent format for the raw data, and I need to support all the possible formats (there are ~ 40 unique formats that I know of). For examples, here are excerpts from 3 different WHOIS raw data records:
Created on: 2007-01-04
Updated on: 2014-01-29
Expires on: 2015-01-04
Registrant Name: 0,75 DI VALENTINO ROSSI
Contact: 0,75 Di Valentino Rossi
Registrant Address: Via Garibaldi 22
Registrant City: Pradalunga
Registrant Postal Code: 24020
Registrant Country: IT
Administrative Contact Organization: Giorgio Valoti
Administrative Contact Name: Giorgio Valoti
Administrative Contact Address: Via S. Lucia 2
Administrative Contact City: Pradalunga
Administrative Contact Postal Code: 24020
Administrative Contact Country: IT
Administrative Contact Email: giorgio_v@mac.com
Administrative Contact Tel: +39 340 4050596
---------------------------------------------------------------
Registrant :
onse telecom corporation
Gangdong-gu Sangil-dong, Seoul
Administrative Contact :
onse telecom corporation ruhisashi@onsetel.co.kr
Gangdong-gu Sangil-dong, Seoul,
07079976571
Record created on 19-Jul-2004 EDT.
Record expires on 19-Jul-2015 EDT.
Record last updated on 15-Jul-2014 EDT.
---------------------------------------------------------------
Registrant:
Name: markaviva comunica??o Ltda
Organization: markaviva comunica??o Ltda
E-mail: helissonmaia@markaviva.com.br
Address: RUA FERNANDES LIMA 360 sala 03
Address: 57300070
Address: ARAPIRACA - AL
Phone: 55 11 40039011
Country: BRASIL
Created: 20130405
Updated: 20130405
Administrative Contact:
Name: markaviva comunica??o Ltda
Organization: markaviva comunica??o Ltda
E-mail: helissonmaia@markaviva.com.br
Address: RUA FERNANDES LIMA 360 sala 03
Address: 57300070
Address: ARAPIRACA - AL
Phone: 55 11 40039011
Country: BRASIL
Created: 20130405
Updated: 20130405
As you can see, there's no repeating pattern. I need to extract fields such as 'Registrant Name', 'Registrant Address', 'Admin Name', 'Admin City', etc...
I first tried a basic method of field extraction, based on splitting the line on the first colon found, but it only works when the row prefixes are distinct, injective (no 2 rows with the same prefix exists) and, well, separated by a colon... (which is not always the case)
Now, I could go over the formats one by one and try to come up with a regex for each one of them, but that would require a lot of time, which I don't have. I wonder if there's any way to automatically mine and treat blocks of text as a context-based "chunk" (with regards to their spacing and common repeating words such as 'registrant' or 'admin') and analyze them accordingly. NLP Maybe?
I'll be glad to hear any ideas, as I'm kind of stumped here. Thanks