My data looks like this:
ID Email
1 someone@hotmail.com
2 someone1@hotmail.com
3 someone2@hotmail.com
4 someone3@hotmail.com
5 someone4@hotmail.com
6 someone5@hotmail.com
There should be exactly 1 email per ID, but there's not.
> dim(data)
[1] 5071 2
> length(unique(data$Person_Onyx_Id))
[1] 5071
> length((data$Email))
[1] 5071
> length(unique(data$Email))
[1] 4481
So, I need to find the ID's with duplicated email addresses.
Seems like this should be easy, but I'm striking out:
> sqldf("select ID, count(Email) from data group by ID having count(Email) > 1")
[1] ID count(Email)
<0 rows> (or 0-length row.names)
I've also tried taking off the having
clause and sending the result to an object and sorting the object by the count(Email)
... it appears that every ID
has count(Email)
of 1...
I would dput
the actual data but I can't due to the sensitivity of email addresses.