0

I was wondering how to extract all the valid subdomains and domains from any file content. Many sites can extract domains from any text online. but I was wondering, how can I do this on the terminal using a linux machine.

Using grep I can do this using this regex: (?:[a-z0-9](?:[a-z0-9-]{0,61}[a-z0-9])?\.)+[a-z0-9][a-z0-9-]{0,61}[a-z0-9]

Example:

echo  "extract example.com and a.example.cloud. from all content " | grep -oP  "(?:[a-z0-9](?:[a-z0-9-]{0,61}[a-z0-9])?\.)+[a-z0-9][a-z0-9-]{0,61}[a-z0-9]"

Any other ways of doing this?

pancho
  • 176
  • 1
  • 2
  • 13
  • this is not a duplicated the other regex are not working as good. This can extract from any file content. – pancho Nov 06 '18 at 07:36
  • `423.231.413.14` is not a valid IP address anyway. The regex you quote may match hostnames but not domain names, as those are defined in the RFC about DNS to be basically any sequence of bytes, with just some limits on total length and each label length. On top of that you may then have other rules like "no numerical TLD". So in short you need to provide far more context to your question to see why you need to extract domains. – Patrick Mevzek Nov 06 '18 at 14:59
  • Only subdomains and domains have to be allow. False positive are also allow for example `test.example.noexit` – pancho Nov 06 '18 at 19:44

0 Answers0