0

Consequently, I initially intended to write this in Python, but I am completely open to writing it in any language.

I'm working with the Alexa top 1 million right now, and I want to find out how many of these domains have DMARC enabled. To obtain the TXT record on the "dmarc" subdomain, I am making use of DNS Python. This is version 1, and when I have enough time, I will eventually parse it and load it into a Dynamo Table or Postgres (TBD) database. Time is a big factor because this data set will grow in the future.

The current code, which I have gotten down to the quickest, can be found below. It takes approximately 26 hours to run due to its 440 queries per minute. I want to write that down as much as I can.

I considered breaking this up into batches and starting those at the same time, but if I wanted to seriously cut down on time, I would have to do more than 100 batches. Naturally, the number of batches would also increase as this data set expanded)

import csv
import tqdm
from utils.dns import DNS
from concurrent import futures

with open('../data/top-1m.csv', 'r') as f:
    reader = csv.reader(f)
    domains = [domain for _, domain in reader]

def runDNS(domain):
    return DNS().query(f"_dmarc.{domain}", "TXT")

# generate a tqdm progress bar for domains
with tqdm.tqdm(domains, total=len(domains), desc="Check DMARC") as tqdm_domains:
        with futures.ProcessPoolExecutor(max_workers=600) as executor:
            for domain in zip(domains):
                executor.map(runDNS(domain[0]), domain)
                tqdm_domains.update(1)
tadman
  • 208,517
  • 23
  • 234
  • 262
  • Are you CPU bound, network bound, or being throttled by your DNS server? – tadman Jan 23 '23 at 03:36
  • @tadman - From what I can see I am not seeing any limits. Checked my logs, from the DNS utils function and I am not seeing any network related timeouts or DNS rate limiting. I am using 1.1.1.1 as DNS Resolver, and I have a timeout set to 1 second for the DNS query. – PrettyNewbieAtThis Jan 23 '23 at 03:41
  • You'll need to find out what your bottleneck is before trying to optimize it. I've done similar things in Rust, [Tokio is fantastic](https://tokio.rs), but did hit limits with sockets first, as TCP DNS resolution can be very resource intensive. DMARC in particular can't always fit in a UDP DNS packet, it gets truncated, which can slow down queries significantly. You should also check that your NAT/firewall can handle the *extraordinary* number of connections this kind of activity can generate. – tadman Jan 23 '23 at 03:44
  • @tadman - Any recommendations on how to best investigate where the bottlenecks exists? I am seeing some DNS Timeout after digging into the logs, but it's just hitting the 1 sec timeout, which is fine. – PrettyNewbieAtThis Jan 23 '23 at 03:50
  • A) Keep an eye on your TCP socket availability. In my testing I quickly "ran out" of sockets when doing mass queries. B) Run your own DNS server so you can tweak resolution rates. I used the bind9 container on Docker, it's easy to configure from there. C) You may need to split this across multiple VMs to avoid socket limits, if that's impacting you. – tadman Jan 23 '23 at 04:31

0 Answers0