I must be naïve but I'm very surprised to hear that some domains can have a million subdomains. In any case, I suspect that a significant majority of domains would have less than 100 subdomains so for the most part, your current table schema is going to be fine and you just need to deal with the really "large" domains.
This is a common problem for social apps and in Graph Theory it is known as the supernode problem -- a vertex with an incredibly high number of edges. In simpler terms, it's Barack Obama (the vertex or node) with over 133M followers (edges) on Twitter, or Cristiano Ronaldo with over 506M followers on Instagram.
For apps that run into the supernode problem, they typically work around it by handling the supernodes separately from the rest. In your case, you need to implement some logic in your app to detect the "super domains" and store them in a separate table.
A possible table design uses the first 2 characters of the subdomain as a bucket. For example with domain sub.domainsr.us
, we use the prefix su
for bucketing to make the partitions smaller:
CREATE TABLE subdomains_by_domain_tld_prefix (
domain text,
tld text,
prefix text,
subdomain text,
a text,
PRIMARY KEY ((domain, tld, prefix), subdomain)
)
This is just an example so the prefix
doesn't have to be limited to just the first 2 characters. You can adjust it depending on the dataset.
Also if it makes it simpler for your app, you can choose to use this table for all domains. Cheers!
Please support the Apache Cassandra community by hovering over the cassandra
tag above and click on Watch tag. Thanks!