List Comprehension Mystery - Python

Question

I have created two CSV lists. One is an original CSV file, the other is a DeDuped version of that file. I have read each into a list and for all intents and purposes they are the same format. Each list item is a string.

I am trying to use a list comprehension to find out which items were deleted by the duplication. The length of the original is 16939 and the list of the DeDupe is 15368. That's a difference of 1571, but my list comprehension length is 368. Ideas?

deduped = open('account_de_ex.csv', 'r')
deduped_data = deduped.read()
deduped.close()
deduped = deduped_data.split("\r")

#read in file with just the account names from the full account list
account_names = open('account_names.csv', 'r')
account_data = account_names.read()
account_names.close()
account_names = account_data.split("\r")

# Get all the accounts that were deleted in the dedupe - i.e. get the duplicate accounts
dupes = [ele for ele in account_names if ele not in deduped]

Edit: For some notes in the comments, here is a test on my list comp and the lists themselves. Pretty much the same difference, 20 or so off. Not the 1500 i need! thanks!

print len(deduped)
deduped = set(deduped)
print len(deduped)

print len(account_names)
account_names = set(account_names)
print len(account_names)


15368
15368
16939
15387

An obvious possibility to consider is that there were more than just 2 occurences of some of the duplicate names. If the average duplicated name appeared roughly 6 times in the original list, that would account for the numbers you've quoted. — Mark Amery, Oct 29 '13 at 22:09
Well, the problem could come from your input data! What @Mark said is true, but we can't be sure if you don't provide any example of your CSV files. — Benjamin Toueg, Oct 29 '13 at 22:09
Why do you split on `"\r"` rather than the more usual `"\n"`? If you open your file in "text" mode, or if the file is generated on a Linux system, I don't think it will work right. — steveha, Oct 29 '13 at 22:16
You can check your list comprehension with sets. Try these three statements: ```deduped = set(deduped) | account_names = set(account_names) | dupes = account_names.difference(deduped)``` — wwii, Oct 29 '13 at 22:24
@dwerner I've tried a few DeDuping processes, including: pandas, set. This is just a excel DeDup to make sure I was working with the same format. — MaxSavageKramer, Oct 29 '13 at 22:26
@Mark Amery I've sorted and looked through the data and the only duplicates left are variations that I want to keep. Thanks! — MaxSavageKramer, Oct 29 '13 at 22:27
Maybe you are doing something wrong when you create account_de_ex.csv. — wwii, Oct 29 '13 at 23:00
@wwii yes, but that is where it is a mystery. what might be wrong? — MaxSavageKramer, Oct 29 '13 at 23:05
Well that's plain logic and set theory. It's just obvious that the files do not contain the data you think they do. No other possibility, unless you've introduced some bugs to Python interpreter and recompiled it for your own purpose :) Splitting on `\r` seems suspicious also though - it may be that some whitespaces cause trouble. Try `if ele.strip() not in deduped`. Or even filter all lists with `strip` before calculations. — BartoszKP, Oct 29 '13 at 23:06
Your list comprehension is wrong: it doesn't give the deleted entries.. how it is written it should return an empty list. Am I wrong? — , Oct 29 '13 at 23:11

score 2 · Accepted Answer · edited May 23 '17 at 11:57

Try running this code and see what it reports. This requires Python 2.7 or newer for collections.Counter but you could easily write your own counter code, or copy my example code from another answer: Python : List of dict, if exists increment a dict value, if not append a new dict

from collections import Counter

# read in original records
with open("account_names.csv", "rt") as f:
    rows = sorted(line.strip() for line in f)

# count how many times each row appears
counts = Counter(rows)

# get a list of tuples of (count, row) that only includes count > 1
dups = [(count, row) for row, count in counts.items() if count > 1]
dup_count = sum(count-1 for count in counts.values() if count > 1)

# sort the list from largest number of dups to least
dups.sort(reverse=True)

# print a report showing how many dups
for count, row in dups:
    print("{}\t{}".format(count, row))

# get de-duped list
unique_rows = sorted(counts)

# read in de-duped list
with open("account_de_ex.csv", "rt") as f:
    de_duped = sorted(line.strip() for line in f)

print("List lengths: rows {}, uniques {}/de_duped {}, result {}".format(
        len(rows), len(unique_rows), len(de_duped), len(de_duped) + dup_count))

# lists should match since we sorted both lists
if unique_rows == de_duped:
    print("perfect match!")
else:
    # if lists don't match, find out what is going on
    uniques_set = set(unique_rows)
    deduped_set = set(de_duped)

    # find intersection of the two sets
    x = uniques_set.intersection(deduped_set)

    # print differences
    if x != uniques_set:
        print("Rows in original that are not in deduped:\n{}".format(sorted(uniques_set - x)))
    if x != deduped_set:
        print("Rows in deduped that are not in original:\n{}".format(sorted(deduped_set - x)))

Wow, thank you so much. This did everything I needed. It showed me that everything that is in the original that is not in deduped (and vice versa) is an issue of quotations. Otherwise it picked up all the dupes! THANK YOU! <3 — MaxSavageKramer, Oct 31 '13 at 23:16

tk. · Answer 2 · 2013-10-29T23:32:21.493

0

To see what you really have in each list you can proceed by construction :

If you only had unique elements :

deduped = range(15368)
account_names2 = range(15387)
dupes2 = [ele for ele in account_names2 if ele not in deduped] #len is 19

However because you have repetitions of removed and not removed elements you actually end up with :

account_names =account_names2 + dupes2*18 + dupes2[:7] + account_names2[:1571  - 368]
dupes = [ele for ele in account_names if ele not in deduped] # dupes will have 368 elements

edited Oct 29 '13 at 23:32

answered Oct 29 '13 at 23:22

tk.

626
4
14

Wow, thanks. I... almost understand? How do you suggest I go about trouble shooting? – MaxSavageKramer Oct 29 '13 at 23:32
I did a set intersection and got a list of 15054 elements. are these safe to work with without the interference you are talking about? dupes = set(deduped).intersection(account_names) – MaxSavageKramer Oct 29 '13 at 23:36
I just made an edit to make it more clear, if by DeDuped you mean strictly removing duplicates ("John" != "JOhn"), then deduped = set(account_names) which means that you have one or several accounts repeated and in total 1552 repetitions, if you mean something else then you should create a filter. – tk. Oct 29 '13 at 23:41

List Comprehension Mystery - Python

2 Answers2