Extracting rootdomains from URL string in Google Sheets

Question

Hi I am trying to extract the rootdomain from URL string in Google Sheets. I know how to get the domain and I have the formula to remove www. but now I realize it does not strip subdomain prefixes like 'mysite'.site.com; where mysite is not stripped from the domain name.

Question: How can I retrieve the domain.com rootdomain where the domain string contacts alphanumeric characters, then 1 dot, then alphanumeric characters (and nothing more)

Formula so far in Google Sheets:

=REGEXREPLACE(REGEXREPLACE(D3923;"(http(s)?://)?(www\.)?";"");"/.*";"")

Maybe this can be simplified ...

Test cases

    https://www.domain.com/ => domain.com
    https://domain.com/ => domain.com
    http://www.domain.nl/ => domain.com
    http://domain.de/ => domain.com
    http://www.domain.co.uk/ => domain.co.uk
    http://domain.co.au/ => domain.co.au
    sub.domain.org/ => sub.domain.com
    sub.domain.org => sub.domain.com
    domain.com => domain.com
    http://www.domain.nl?par=1  => domain.com
    https://www.domain.nl/test/?par=1  => domain.com
    http2://sub2.startpagina.nl/test/?par=1  => domain.com

snh_nl · Accepted Answer · 2016-07-07T08:11:16.587

46

Currently using:

=trim(REGEXEXTRACT(REGEXREPLACE(REGEXREPLACE(A2;"https?://";"");"^(w{3}\.)?";"")&"/";"([^/?]+)"))

Seems to work fine

Updated:7-7-2016

(thanks for all the help!)

edited Jul 07 '16 at 08:11

answered Aug 23 '15 at 13:29

snh_nl

2,877
6
32
62

5

To be honest, I have no idea if this would be any more efficient, but I guess it only uses one function: `=REGEXEXTRACT("."&B1&"/","\.([^.]+\.[^./?]+)(?:/|\?)")` – AdamL Aug 24 '15 at 01:06
1

Cool. Works great. And yes only 1 function. +1 – snh_nl Aug 24 '15 at 07:05
1

Please add is as an answer. I will approve. This will help many. – snh_nl Aug 24 '15 at 07:15
1

@AdamL It seems to be working fine. Only thinkg missing is when there is a domain sub.domain.com where www.domain.com should become => domain.com and sub.domain.com => sub.domain.com .... I was thinking of adding ([^w]\.)? but this does not work for all the test cases ... help appreciated – snh_nl Sep 17 '15 at 11:26
1

There's an issue with http / https filtering. Try on `https://moneytis.com/` and `http://sumome.com/` – Denis Gorbachev Apr 22 '16 at 14:20
Thanks, this worked well! How would I add it across thousands of rows (ie apply to an entire column that is very long)? – Matt Feb 27 '17 at 20:21
Paste it in the column next to it. – snh_nl Feb 27 '17 at 20:34
@snh_nl Any idea how you can provide a default value in the trimmed cell if the url string cell is empty? Thanks! – Ethan Jul 20 '18 at 15:49
so many bugs in this implementation , the second solution is more reliable – Frederic Bazin May 22 '20 at 11:41
@Ethan add if statement – snh_nl May 22 '20 at 14:09

Igor Mikushkin · Answer 2 · 2021-06-22T14:31:53.023

I think that a most reliable way is to check over TLD list because of TLDs like co.uk, gov.uk and so on that are impossible to extract via a simple regex.

You can define these functions in Tools -> Script editor

function endsWith(str, searchString) {
    position = str.length - searchString.length;
    var lastIndex = str.lastIndexOf(searchString);
    return lastIndex !== -1 && lastIndex === position;
}

function rawToTlds(raw) {
    var letter = new RegExp(/^\w/);
    return raw.split(/\n/).filter(function (t) { return letter.test(t) })
}

function compressString(s) {
    var zippedBlob = Utilities.gzip(Utilities.newBlob(s))
    return Utilities.base64Encode(zippedBlob.getBytes())
}

function uncompressString(x) {
    var zippedBytes = Utilities.base64Decode(x)
    var zippedBlob = Utilities.newBlob(zippedBytes, 'application/x-gzip')
    var stringBlob = Utilities.ungzip(zippedBlob)
    return stringBlob.getDataAsString()
}

function getTlds() {
    var cacheName = 'TLDs'
    var cache = CacheService.getScriptCache();
    var base64Encoded = cache.get(cacheName);
    if (base64Encoded != null) {
        return uncompressString(base64Encoded).split(',')
    }
    var raw = UrlFetchApp.fetch('https://publicsuffix.org/list/public_suffix_list.dat').getContentText()
    var tlds = rawToTlds(raw)
    cache.put(cacheName, compressString(tlds.join()), 21600)
    return tlds
}

function getDomainName(url, level) {
    var tlds = getTlds()

    var domain = url
        .replace(/^http(s)?:\/\//i, "")
        .replace(/^www\./i, "")
        .replace(/\/.*$/, "")
        .replace(/\?.*/, "");

    if (typeof level === 'undefined') {
        return domain
    }

    var result = domain
    var longest = 0
    for (i in tlds) {
        var tld = '.' + tlds[i]
        if (endsWith(domain, tld) && tld.length > longest) {
            var parts = domain.substring(0, domain.length - tld.length).split('.')
            result = parts.slice(parts.length-level+1, parts.length).join('.') + tld
            longest = tld.length
        }
    }

    return result
}

To get second-level domian of A1 use it like this

=getDomainName(A1, 2)

To get full domain of A1 just do

=getDomainName(A1)

EDIT

Public Suffix List has exceeded 100KB. It doesn't fit in Apps Script cache anymore. So I'm gzipping it now.

I think this answer is probably the better than the regex one, however there is an issue with this code because the max cache size is 100KB and the entire TLD list is over 8000 entries. Once I removed the cache (which was a bad idea but I just wanted to test) it didn't work correctly with blogspot subdomains. — Dagmar, Jun 15 '21 at 15:21
Ha! Interesting - following on from my last comment (I pasted the TLDs into a sheet) I see "blogspot.com" is considered a TLD — Dagmar, Jun 15 '21 at 16:16
@Dagmar The list is not exactly TLD list. This is the list of domains under which Internet users can register names. It became quite huge nowadays as well. So it is time to update an answer. I'll try to post an update in a day or two — Igor Mikushkin, Jun 16 '21 at 10:47
@Dagmar It looks like simple gzipping works well and quite fast in this case. Answer is updated — Igor Mikushkin, Jun 22 '21 at 14:27
I don't understand why this answer isn't voted higher. Regex can never compete with actual knowledge of TLDs. — Kit Johnson, Jan 21 '22 at 08:00

score 1 · Answer 3 · answered Jun 06 '21 at 16:58

1

try:

=INDEX(IFERROR(REGEXEXTRACT(A1:A, 
 "^(?:https?:\/\/)?(?:ftp:\/\/)?(?:www\.)?([^\/]+)")))

answered Jun 06 '21 at 16:58

player0

124,011
12
67
124

Extracting rootdomains from URL string in Google Sheets

3 Answers3

Linked

Related