-4

I have a big jsn list which contains a lot of string elements with possible duplicate values. I need to check each element for similarity and add duplicate list item keys in dubs list to remove these items from jsn list.

Because of size of jsn list i decided to use Threading in my code to speed up second for loop execution and waiting time

But Thread/Process is not working as i expected.

The code below with Thread inside changes nothing in performance and also dubs list is empty after Threads join is finished

I tried without success.join() but i still got empty dubs list and no change in performance.

The main problem -> dubs list is empty before starting deleting duplicates.

from threading import Thread
from multiprocessing import Process
from difflib import SequenceMatcher

# Searching for dublicates in array
def finddubs(jsn,dubs,a):
    for b in range(len(jsn)):
        if ((jsn[a] == jsn[b]) or (SequenceMatcher(None, jsn[a], jsn[b]).ratio() > 40)):
                dubs.append(b) # add dublicate list element keys to dublicates array
 
# Start threading
threads = []
for a in range(len(jsn)):
    t = Thread(target=finddubs, args=(jsn,dubs,a))
    threads.append(t)
    t.start()
for thr in threads:
    thr.join()

# Delete duplicate list items 
for d in dubs:
    k = int(d)
    del jsn[k]

Without threading code is working

redevil
  • 155
  • 7
  • You should be getting *lots* of errors when you run this - you're passing three parameters when you launch `finddubs()` in each thread, but the function takes no parameters. – jasonharper Sep 15 '22 at 14:40
  • @jasonharper I edited code finddubs() now contains args but still not working – redevil Sep 15 '22 at 14:41
  • Python threading for CPU-bound operations doesn’t help due to the GIL serializing Python code. Multiprocessing can help, depending on the size of the job since process startup is expensive. – Mark Tolonen Sep 15 '22 at 14:58
  • Could you add some more code please? – redevil Sep 15 '22 at 15:23
  • you send `dubs` to threads `t = Thread(target=finddubs, args=(jsn,dubs,a))` but you don't have this variable `dubs` - and this should gives error. Did you run code in console/terminal to see errors? If you get error then show it in question (not in comments) as text (not image) – furas Sep 15 '22 at 20:44
  • Maybe first use `print()` (and `print(type(...))`, `print(len(...))`, etc.) to see which part of code is executed and what you really have in variables. It is called `"print debuging"` and it helps to see what code is really doing. – furas Sep 15 '22 at 20:46
  • deleting elements can be wrong idea - when you delete element then other elements change positions and they will have different indexes and it may delete wrong elements (or it may try to delete in place which doesn't exist) – furas Sep 15 '22 at 20:49
  • your code compares `jsn[0]` with `jsn[0]` and removes it. And the same with all other elements. You should check `a != b` – furas Sep 15 '22 at 20:53
  • if you get `set(jsn)` then it will remove exactly the same elements and you will no need `jsn[a] == jsn[b]` – furas Sep 15 '22 at 20:56

1 Answers1

2

You need to use multiprocessing instead of threading if you want to speedup your computations. Please read about GIL for detailed information on topic.

An example of how multiprocessing can be used for this task:

import multiprocessing
from difflib import SequenceMatcher
from uuid import uuid4

# Let's generate a large list with random data
# where we have few duplicates: "abc" indices: 0, 1_001 ; "b" - indices 1_002, 1_003
jsn = ['abc'] + [str(uuid4()) for _ in range(1_000)] + ['abc', 'b', 'b']


def compare_strings(a: int, b: int):
    if ((jsn[a] == jsn[b]) or (SequenceMatcher(None, jsn[a], jsn[b]).ratio() > 40)):
        return a, b


# now we are comparing all possible pairs using multiprocessing
with multiprocessing.Pool(processes=10) as pool:
    results = pool.starmap(compare_strings, [(i, j) for i in range(len(jsn)) for j in range(i + 1, len(jsn))])

for result in results:
    if result is not None:
        a, b = result
        print(f"Duplicated pair: {a} {b} {jsn[b]}")
        # delete duplicates

modification of your code that should work:

from difflib import SequenceMatcher
from threading import Thread
from uuid import uuid4

# Let's generate a large list with random data
# where we have few duplicates: "abc" indices: 1, 10_001 ; "b" - indices 10_002, 10_003
jsn = ["abc"] + [str(uuid4()) for _ in range(1_00)] + ["abc", "b", "b"]
dubs = []

# Searching for dublicates in array
def finddubs(jsn, dubs, a):
    for b in range(a + 1, len(jsn)):
        if (jsn[a] == jsn[b]) or (SequenceMatcher(None, jsn[a], jsn[b]).ratio() > 40):
            print(a, b)
            dubs.append(b)  # add dublicate list element keys to dublicates array


# Start threading
threads = []
for a in range(len(jsn)):
    t = Thread(target=finddubs, args=(jsn, dubs, a))
    threads.append(t)
    t.start()
for thr in threads:
    thr.join()

# Delete duplicate list items
print(dubs)
for d in dubs:
    k = int(d)
    del jsn[k]
u1234x1234
  • 2,062
  • 1
  • 1
  • 8
  • Thanks your modified code is working!! Could you explain please what is difference between: for b in range(len(jsn)): and for b in range(a + 1, len(jsn)):? please explain why my for loop is not working? – redevil Sep 16 '22 at 06:46
  • You have no defined `dubs` variable so it does not work. I've added range(a+1, n) instead of range(n) because I'm guessing you are trying to find duplicates and compare elements with each other. But you are using symmetric distance so you may skip some pair comparisons - x == y is the same as y == x. I'm iterating only over unique indices of pairs. – u1234x1234 Sep 16 '22 at 18:21