0

I have a list of 'ids':

ids = [None, '20160928a', '20160929a', ... ]

and another list of certain 'ids' that I found were duplicate ids using fuzzywuzzy:

repeat_offenders = ['20160928a', '20161115a', '20161121a', ... ]

I would like to use fuzzywuzzy again to create a list of lists that contains lists of where (by index) the duplicate ids are located within the list 'ids'. So the output would look something like this (and because they are duplicates each list within the list would contain at least two elements:

collected_ids = [[0,5,700], [6,3], [4,826,12]]

My attempt, which currently only returns the ids not the location of the id:

collected_urls = []
for offender in repeat_offenders[:10]:
     best_match = process.extract(offender, ids)
     collection = []
     for match in best_match:
         if match[1] > 95:
            collection.append(match[0])
         else:
            pass
     collected_urls.append(collection)

Update, my attempt at using Moe's answer to find/group exact matches:

idz = ids
collected_ids = []
for i in range(len(idz)):
    tmp = [i]
    for j in range(len(ids)):
        if idz[i] == idz[j] and i != j:
            tmp.append(j)
            del j 
    if len(tmp) > 1:
        collected_ids.append(tmp)
    del i
Graham Streich
  • 874
  • 3
  • 15
  • 31
  • any reasons why you're using `fuzzywuzzy` to find duplicates, if the strings match can't you simply test for equality? – gold_cy Jun 23 '17 at 19:34
  • because I want to group similar ids as the 'duplicates' may not be exactly (100%) the same that's why I used the 95 threshold...however quite a few of the duplicates are exactly the same -- so that using an equality may be the easiest (and a good enough) option..would you mind showing me how do what I am asking for based on an equality? thanks! – Graham Streich Jun 23 '17 at 19:42

1 Answers1

1

If using fuzzywuzzy is not a must, you can use two for-loops to check for duplicates and generate the list as the following:

collected_ids = []
for i in xrange(len(ids)):
    tmp = [i]
    for j in xrange(len(ids)):
        if ids[i] == ids[j] and i != j:
            tmp.append(j)
    if len(tmp) > 1:
        collected_ids.append(tmp)
collected_ids = list(set(collected_ids))

EDIT:

If you want to avoid duplicates you can create a list to check whether the index is added already or not as the following:

collected_ids = []
ids = ['a', 'b', 'a', 'c', 'd', 'a', 't', 't', 'k', 'c']
check = [] 
for i in range(len(ids)):
    tmp = [i]
    check.append(i)  
    for j in range(len(ids)):
        if ids[i] == ids[j] and i != j and j not in check:
            tmp.append(j)
            check.append(j)
    if len(tmp) > 1:
        collected_ids.append(tmp)
print(collected_ids)

output:

[[0, 2, 5], [3, 9], [6, 7]]
Mohd
  • 5,523
  • 7
  • 19
  • 30
  • Thanks, but this isn't working...xrange() is not defined in python3 (which is what I am using). Another thing is that once 'i' has found it it's matches it and its matches should be deleted from the list 'ids' because otherwise it'll create an iterative process and contain a duplicated list within collected_ids...I tried solving these problems and have 'updated' my question with my attempt :) – Graham Streich Jun 24 '17 at 12:40
  • @GrahamStreich I have updated the answer to avoid duplicates, please check it out and let me know =) – Mohd Jun 24 '17 at 12:59
  • Glad to know :) and please don't forget to accept the answer if it solves your question! – Mohd Jun 24 '17 at 13:13
  • Thanks, I would accept the answer except your answer only provides a heuristic to the actual problem of using fuzzywuzzy ;p I do appreciate your time and answer thought :) – Graham Streich Jun 24 '17 at 16:46
  • @GrahamStreich If you are using python3 then use `range` instead of `xrange` – Jaffer Wilson Jul 26 '17 at 07:37