0

I have a list of 30 million strings, and I want to run a dns query to all of them using python. I do not understand how this operation can get memory intensive. I would assume that the threads would exit after the job is done, and there is also a timeout of 1 minute as well ({'dns_request_timeout': 1}).

Here is a sneak peek of the machine's resources while running the script:

enter image description here

My code is as follows:

# -*- coding: utf-8 -*-
import dns.resolver
import concurrent.futures
from pprint import pprint
from json import json


bucket = json.load(open('30_million_strings.json','r'))


def _dns_query(target, **kwargs):
    global bucket
    resolv = dns.resolver.Resolver()
    resolv.timeout = kwargs['function']['dns_request_timeout']
    try:
        resolv.query(target + '.com', kwargs['function']['query_type'])
        with open('out.txt', 'a') as f:
            f.write(target + '\n')
    except Exception:
        pass


def run(**kwargs):
    global bucket
    temp_locals = locals()
    pprint({k: v for k, v in temp_locals.items()})

    with concurrent.futures.ThreadPoolExecutor(max_workers=kwargs['concurrency']['threads']) as executor:
        future_to_element = dict()

        for element in bucket:
            future = executor.submit(kwargs['function']['name'], element, **kwargs)
            future_to_element[future] = element

        for future in concurrent.futures.as_completed(future_to_element):
            result = future_to_element[future]


run(function={'name': _dns_query, 'dns_request_timeout': 1, 'query_type': 'MX'},
    concurrency={'threads': 15})
MortyAndFam
  • 159
  • 1
  • 8
  • 2
    Use C or C++ ? XD – Rob Jul 18 '18 at 22:38
  • @Rob I have been thinking of using rust, but I am not sure I can justify using a lower level language because there is a leak. I don't know. – MortyAndFam Jul 19 '18 at 08:24
  • No answers?? I am surprised that nobody seems to know what might be causing this problem on stackoverflow... – MortyAndFam Jul 19 '18 at 10:56
  • You should profile your code to see exactly what is causing the memory to not be released, try memory_profile. Then you will see what is causing the massive memory utilization – Mattia Procopio Jul 20 '18 at 19:39

1 Answers1

0

try this:

def sure_ok(future):
    try:
        with open('out.txt', 'a') as f:
            f.write(str(future.result()[0]) + '\n')
    except:
        pass

with concurrent.futures.ThreadPoolExecutor(max_workers=2500):
    for element in json.load(open('30_million_strings.json','r')):
        resolv = dns.resolver.Resolver()
        resolv.timeout = 1
        future = executor.submit(resolv.query, target + '.com', 'MX')
        future.add_done_callback(sure_ok)

remove global bucket as it is redundant, and not needed.

remove reference of the 30+ million futures in a dictionary, also redundant.

also you're probably not using a new enough version of concurrent.futures:

jmunsch
  • 22,771
  • 11
  • 93
  • 114