2

I really need help on how to approach a problem. I have a data set that looks like this..

Name   Sex    Total    
Anna   F      10
Jamie  M      2
Jamie  F      7
Mike   M      13
Sam    F      6
Sam    M      3

structure(list(Name = c("Anna", "Jamie", "Jamie", "Mike", "Sam", "Sam"), 
Sex = c("F", "M", "F", "M", "F", "M"), Total = c(10L, 2L, 7L, 13L, 6L, 3L)), 
.Names = c("Name", "Sex", "Total"), class = "data.frame", row.names = c(NA, -6L))

What I want to do is get the names that are both male and female names, so the results will look like..

Name   Sex  Total
Jamie  M    2
Jamie  F    7
Sam    M    3
Sam    F    6

But I'm really stumped on how to approach it.

hrbrmstr
  • 77,368
  • 11
  • 139
  • 205
CodeLearner
  • 389
  • 2
  • 6
  • 14
  • Similar, but a bit more involved task [here](http://stackoverflow.com/questions/26347343/group-androgynous-names-and-sum-amount-for-each-year-in-a-data-frame-in-r). – ilir Oct 17 '14 at 20:24

4 Answers4

5

This is how I would approach it assuming the data is stored in d:

# get a vector (set) of names that are use by both M and F
dual.names <- intersect(d$Name[d$Sex=='M'], d$Name[d$Sex=='F'])

# use set of dual names to filter data
d[d$Name %in% dual.names, ]
ilir
  • 3,236
  • 15
  • 23
4

Obligatory Hadleyverse (dplyr & tidyr) answer:

library(tidyr)
library(dplyr)

dat %>% 
  spread(Sex, Total) %>% 
  filter(!is.na(M), !is.na(F)) %>% 
  gather(Sex, Total, M, F) %>%
  arrange(Name)

##    Name Sex Total
## 1 Jamie   M     2
## 2 Jamie   F     7
## 3   Sam   M     3
## 4   Sam   F     6

EDIT and MUCH better dplyr approach via @konvas' comment:

dat %>% group_by(Name) %>% filter(length(unique(Sex)) == 2)

EDIT and, further refined by @David's comment:

dat %>% group_by(Name) %>% filter(n_distinct(Sex) == 2)

(Can I transfer points to @konvas & @David? :-)

hrbrmstr
  • 77,368
  • 11
  • 139
  • 205
  • 2
    I think you can avoid spreading and gathering like this `d %>% group_by(Name) %>% filter(length(unique(Sex)) == 2)` – konvas Oct 17 '14 at 20:18
  • @konvas this is `dplyr`, spreading and gathering is half the fun – ilir Oct 17 '14 at 20:20
  • 2
    @ilir haha fair enough :).. but if the OPs dataset is large, there is a significant speed improvement – konvas Oct 17 '14 at 20:20
  • Aye (+1 @konvas! even tho we're not supposed to do +1's in comments :-). I initially tried to do it all with just `tidyr` and completed the approach with `dplyr`. Much more compact & performant your way. And, kinda agree with @ilir that @hadley made it too easy & fun to slice & dice). – hrbrmstr Oct 17 '14 at 20:21
  • 1
    You should use dplyrs `n_distinct(Sex)` instead of `length(unique(Sex))` – David Arenburg Oct 18 '14 at 20:11
2

You can use ave to count the number of distinct genders for each name and only subset those those that have two genders. For example wiht the sample data

dd<-structure(list(Name = c("Anna", "Jamie", "Jamie", "Mike", "Sam", "Sam"), 
Sex = c("F", "M", "F", "M", "F", "M"), Total = c(10L, 2L, 7L, 13L, 6L, 3L)), 
.Names = c("Name", "Sex", "Total"), class = "data.frame", row.names = c(NA, -6L))

You can do

both<-with(dd, ave(Sex, Name, FUN=function(x) length(unique(x))))=="2"
dd[both, ]

to get

   Name Sex Total
2 Jamie   M     2
3 Jamie   F     7
5   Sam   F     6
6   Sam   M     3

as desired.

MrFlick
  • 195,160
  • 17
  • 277
  • 295
2

Joining a bit late, but here's a data.table approach

library(data.table)
setDT(df)[ , .SD[length(unique(Sex)) == 2], by = Name]
##     Name Sex Total
## 1: Jamie   M     2
## 2: Jamie   F     7
## 3:   Sam   F     6
## 4:   Sam   M     3

Or, if you don't have duplicates, here's a faster solution

setDT(df)[ , .SD[.N == 2], by = Name]
David Arenburg
  • 91,361
  • 17
  • 137
  • 196