Listing duplicate strings of a list in another list

Question

I have the following issue : I have been asked to write a python script to list every pair of duplicate names.

The problem is that just a part of the string is similar, the last part is numbers (deployement time), for exemple :
asg-lc-crl-tst-turfpari-rtl20220124153420214800000001
asg-lc-crl-tst-turfpari-rtl20220330150836189100000001

Let's say ; I have a list with this 8 values :

(0) -- asg-lc-crl-tst-turfpari-rtl20220124153420214800000001  <--- duplicate with (1)
(1) -- asg-lc-crl-tst-turfpari-rtl20220330150836189100000001  <--- duplicate with (0)
(2) -- asg-lc-dpr-dev1-app_hode-hdh20220420140650975800000001  <--- duplicate with (4)
(3) -- asg-lc-crl-di1-ledger-manager-rtl20220414144111344500000001
(4) -- asg-lc-dpr-dev1-app_hode-hdh20220420143831109200000001  <--- duplicate with (2)
(5) -- asg-lc-crl-tst-art-manager-rtl20220124162240173500000001  <--- duplicate with (6)
(6) -- asg-lc-crl-tst-art-manager-rtl20220330150933020900000001  <--- duplicate with (5)
(9) -- asg-lc-bck-ope-backoh-oh20201021134525920100000001
(8) -- asg-lc-bck-ope-springbootadmin-oh20201021134526042200000002

I have written this code but it is not working properly :

def list_duplicate_asg(asg1, asg2):
   if (asg1.rpartition('-')[0] == asg2.rpartition('-')[0]):
       suffix1 = asg1.rpartition('-')[2]
       suffix2 = asg2.rpartition('-')[2]
 
       if(suffix1[0:3] == suffix2[0:3]):
           print('\n ========== Duplicate exists =========: \n')
           print(' + asg1 + ','+ asg2 + '\n ============================ \n')

You see, if the values follow each other in the list, they will be printed like the :

0 & 1 : they get printed
5 & 6 : they get printed
But for exemple the (2) & (4) doesn't get printed ...

I dont know if my method of parsing is efficient or if there's one much better ?
And how can I improve my code to be able to detect duplicate even if they're not in order ?

I want the result to be like this :

Duplicats : asg-lc-crl-tst-turfpari-rtl20220124153420214800000001,asg-lc-crl-tst-turfpari-rtl20220330150836189100000001 
Duplicats : asg-lc-dpr-dev1-app_hode-hdh20220420140650975800000001,asg-lc-dpr-dev1-app_hode-hdh20220420143831109200000001 
Duplicats : asg-lc-crl-tst-art-manager-rtl20220124162240173500000001,asg-lc-crl-tst-art-manager-rtl20220330150933020900000001

If you can guarantee that the last bit that doesn't require to be compared is always the same number of characters (23 if I counted correctly), you could just cut them off and compare just the first bit? — cSharp, Apr 21 '22 at 06:32
I juste listed the most simple names; sometimes it's a very long one so it's very efficient to juste take a precise number of characters : look for exemple : asg-lc-crl-dev-annulation-centrale-rtl20220207153634923900000001 (32 characters before timestamp suffix) asg-lc-crl-dev-operator-manager-rtl20220414134402035700000001 (35 characters before timestamp suffix) asg-lc-crl-in2-turfpari-rtl20220420135427744400000001 (26 characters suffix before timestamp) So that's why I struggle right now. — LightGFX, Apr 21 '22 at 06:55

Florian Metzger-Noel · Answer 1 · 2022-04-21T08:30:02.097

This should work. The strings first get stripped of their timestamp suffix and then registered in a dict (we call that cleaned string "key" for now). The dict keeps track of all the keys that have been found so far. When a key is already known, a duplicate dictionary is filled. The duplicates dict has a list of all duplicates for each key.

import re

asgs = ['asg-lc-crl-tst-turfpari-rtl20220124153420214800000001',
     'asg-lc-crl-tst-turfpari-rtl20220330150836189100000001',
     'asg-lc-dpr-dev1-app_hode-hdh20220420140650975800000001',
     'asg-lc-crl-di1-ledger-manager-rtl20220414144111344500000001',
     'asg-lc-dpr-dev1-app_hode-hdh20220420143831109200000001',
     'asg-lc-crl-tst-art-manager-rtl20220124162240173500000001',
     'asg-lc-crl-tst-art-manager-rtl20220330150933020900000001',
     'asg-lc-bck-ope-backoh-oh20201021134525920100000001',
     'asg-lc-bck-ope-springbootadmin-oh20201021134526042200000002']


def get_duplicate_asgs(asgs: list):
    asgs_found = {}
    duplicates = {}
    for asg in asgs:
        asg_cleaned = asg[0:-26]
        # alternative solution for time stamps of different length:
        # asg_cleaned = re.sub("[0-9]+$", "", asg)
        if asg_cleaned in asgs_found:
            if asg_cleaned in duplicates:
                duplicates[asg_cleaned].append(asg)
            else:
                duplicates[asg_cleaned] = [asgs_found[asg_cleaned], asg, ]
        else:
            asgs_found[asg_cleaned] = asg
    return duplicates.values()

print(get_duplicate_asgs(asgs))

I juste listed the most simple names; sometimes it's a very long one so it's not very efficient to juste take a precise number of characters : look for exemple : asg-lc-crl-dev-annulation-centrale-rtl20220207153634923900000001 (32 characters before timestamp suffix) asg-lc-crl-dev-operator-manager-rtl20220414134402035700000001 (35 characters before timestamp suffix) asg-lc-crl-in2-turfpari-rtl20220420135427744400000001 (26 characters suffix before timestamp) So that's why I struggle right now. — LightGFX, Apr 21 '22 at 06:56
I'm taking off 26 characters from the end. As long as the time stamp always has the same length (which it has in your examples), this should work. If your timestamp also has differents lengths (which would be unusual), then use the alternative solution that I added to the script. — Florian Metzger-Noel, Apr 21 '22 at 08:29

DarkKnight · Answer 2 · 2022-04-21T08:24:03.630

The data shown appears to be made up of at least 3 whitespace delimited tokens per line. The 3rd token is of interest. The timestamp begins after the last occurrence of a hyphen. Therefore:

asgs = ['asg-lc-crl-tst-turfpari-rtl20220124153420214800000001',
        'asg-lc-crl-tst-turfpari-rtl20220330150836189100000001',
        'asg-lc-dpr-dev1-app_hode-hdh20220420140650975800000001',
        'asg-lc-crl-di1-ledger-manager-rtl20220414144111344500000001',
        'asg-lc-dpr-dev1-app_hode-hdh20220420143831109200000001',
        'asg-lc-crl-tst-art-manager-rtl20220124162240173500000001',
        'asg-lc-crl-tst-art-manager-rtl20220330150933020900000001',
        'asg-lc-bck-ope-backoh-oh20201021134525920100000001',
        'asg-lc-bck-ope-springbootadmin-oh20201021134526042200000002']

counter = dict()

for asg in asgs:
    k = asg[:asg.rfind('-')]
    counter[k] = counter.setdefault(k, 0) + 1
    if counter[k] == 2:
        print(k)

Obviously this will need to be adapted according to how the data are actually made available to your program. The output from this will be:

asg-lc-crl-tst-turfpari
asg-lc-dpr-dev1-app_hode
asg-lc-crl-tst-art-manager

My list is in this form : asgs = ['asg-lc-crl-tst-turfpari-rtl20220124153420214800000001', 'asg-lc-crl-tst-turfpari-rtl20220330150836189100000001',...] Your code doesn't work — LightGFX, Apr 21 '22 at 08:11

Listing duplicate strings of a list in another list

2 Answers2