0

I have a JSON file that I am parsing through in an attempt to see if a domain is live.

The code I have is the following:

for i in range(len(json_data)):
    print(i)        
    if int(json_data[i]['response']['result_count'])>0:  
        for j in range(len(json_data[i]['response']['matches'])):
            try: 
                socket.gethostbyname(json_data[i]['response']['matches'][j]['domain'] )
            except:
                del json_data[i]['response']['matches'][j]['domain']

I have attempted to use multithreading in the following form:

def run_half():
    for i in range(0,round(len(data_json)/2)):
        print(i)        # make this len(data_json) if NOT testing, range(10) if testing
        if int(data_json[i]['response']['result_count'])>0:  
            for j in range(len(data_json[i]['response']['matches'])):
                try: 
                    socket.gethostbyname( data_json[i]['response']['matches'][j]['domain'] )
                except:
                    del data_json[i]['response']['matches'][j]['domain']
def run_half_2():
    for i in range(round((len(data_json)/2))+1,len(data_json)):
        print(i)        # make this len(data_json) if NOT testing, range(10) if testing
        if int(data_json[i]['response']['result_count'])>0:  
            for j in range(len(data_json[i]['response']['matches'])):
                try: 
                    socket.gethostbyname( data_json[i]['response']['matches'][j]['domain'] )
                except:
                    del data_json[i]['response']['matches'][j]['domain']

t1 = threading.Thread(target=run_half(),args=(10,))
t2= threading.Thread(target=run_half_2(),args=(10,))

t1.start()
t2.start()

t1.join()
t2.join()

For some reason, I have not noticed a change in the time to compute.

Any advice or suggestions would be greatly appreciated. Thank you!

user3666197
  • 1
  • 6
  • 50
  • 92
  • 2
    when assigning target function to thread pass the function don't call the function ie change `target=run_half()` to `target=run_half` – Abhi_J May 11 '22 at 04:11
  • 2
    Not related to your problem, but generally you would use a single function and have each thread run it with a chunk of the data rather than defining a function for each chunk. – snakecharmerb May 11 '22 at 04:11

1 Answers1

2

Yes, threading useful here as this is a network/IO bound task.

Rather than splitting the work into groups as above, a better approach is to treat each host name check as an individual task and fan-out the execution out to number of workers.

I'd suggest that you use the thread pool executor provided by the python standard library to achieve this.

https://docs.python.org/3/library/concurrent.futures.html

The concept being that you fan-out each long running task into a future, and then fan-in to collect all the results.

e.g,

    list_of_work_to_do = ["url1", "url2", "url3"]

    with ThreadPoolExecutor(max_workers=8) as executor:
        futures = []
    
        # Fan-out work.
        for my_url in list_of_work_to_do:
            future = executor.submit(long_running_task, my_url)
            futures.append(future)

        # Fan-in results.
        results = [future.result() for future in futures]
Grantus
  • 86
  • 5