Trouble Finding ID's with Duplicate Fields

Question

My data looks like this:

ID  Email
1   someone@hotmail.com
2   someone1@hotmail.com
3   someone2@hotmail.com
4   someone3@hotmail.com
5   someone4@hotmail.com
6   someone5@hotmail.com

There should be exactly 1 email per ID, but there's not.

> dim(data)
[1] 5071    2
> length(unique(data$Person_Onyx_Id))
[1] 5071
> length((data$Email))
[1] 5071
> length(unique(data$Email))
[1] 4481

So, I need to find the ID's with duplicated email addresses.

Seems like this should be easy, but I'm striking out:

> sqldf("select ID, count(Email) from data  group by ID having count(Email) > 1")
[1] ID count(Email)  
<0 rows> (or 0-length row.names)

I've also tried taking off the having clause and sending the result to an object and sorting the object by the count(Email)... it appears that every ID has count(Email) of 1...

I would dput the actual data but I can't due to the sensitivity of email addresses.

Try `Group By Emails` then look for IDs where Emails counts more than 1. If you Group By ID you will always get 1 as Count(Email) — Blood-HaZaRd, Feb 05 '15 at 15:43
@shadow `duplicated` gave `logical(0)` Idk why, because there are definitely duplicates... — Hack-R, Feb 05 '15 at 17:54

score 2 · Answer 1 · answered Feb 05 '15 at 15:48

2

My guess is that you have NULL emails. You could find this by using count(*) rather than count(email):

select ID, count(*)
from data
group by ID
having count(*) > 1;

answered Feb 05 '15 at 15:48

Gordon Linoff

1,242,037
58
646
786

Thanks. Yes, I was also suspicious of this but that gives me `<0 rows> (or 0-length row.names)` as well... – Hack-R Feb 05 '15 at 17:54
OK I have had a look at the duplicates via View(datat$Email[order(data$Email),]) so they were duplicates afterall – Hack-R Feb 05 '15 at 17:58

score 2 · Accepted Answer · answered Feb 05 '15 at 15:57

2

Are you also sure you don't have the opposite condition, multiple ids with the same email?

select Email, count(*)
from data
group by Email
having count(*) > 1;

answered Feb 05 '15 at 15:57

HLGEM

94,695
15
113
186

Trouble Finding ID's with Duplicate Fields

2 Answers2