0

I have a data frame with 4 columns titled 'year' 'name' 'sex' 'amount'. Here is a sample data set

set.seed(1)
    data = data.frame(year=sample(1950:2000, 50, replace=TRUE),name=sample(LETTERS, 50, replace=TRUE),
                       sex=sample(c("M", "F"), 50, replace=TRUE), amount=sample(40:100, 50, replace=TRUE))

I want to find only names that occur as both an ‘m’ and an ‘f’ and sum the amount for each year.

Any help would be greatly appreciated

3442
  • 8,248
  • 2
  • 19
  • 41
beck8
  • 35
  • 4
  • Like `aggregate(amount ~ name + sex, data[data$year == 1950,], sum)`? In this case it looks like there were no females for 1950. Oh, well actually that can differ since you used `sample` Please use `set.seed` when using `sample` so that we can use the exact same data you are using – Rich Scriven Oct 13 '14 at 19:24
  • Sorry I was a bit unclear in my question – I want to return the sum of the androgynous names for each year, so that at a glance I can see the amount born with an androgynous name in a certain year. Your code returns: name sex amount R F 69 As R was the only name to appear in 1950 I am unsure if the code has worked or not. I tried aggregate(amount ~ name + sex, data[data$year,], sum) on my own data frame and it returns 3 columns ‘name’ ‘sex’ ‘amount’ but the ‘year’ column is missing and all the values in the ‘sex’ column are ‘f’? – beck8 Oct 13 '14 at 19:41
  • We will continue to get mis-matched results until you add `set.seed(1)` to the top of your code in the post, and then create the `data` set again. This is the consequence of using `sample`, like I said. – Rich Scriven Oct 13 '14 at 19:42

1 Answers1

1

I changed the data a bit, so that there would be common names in some years. On visual inspection this seems to work. The syntax may appear a bit cryptic if you are not familiar with data.table but it's concise and should be quite efficient:

require(data.table)
set.seed(1)
data = data.frame(year=sample(1950:1960, 100, replace=TRUE),name=sample(LETTERS, 100, replace=TRUE),
                  sex=sample(c("M", "F"), 100, replace=TRUE), amount=sample(40:100, 100,  replace=TRUE))
setDT(data) ## change to data.table format
data[, common=sum(amount[name %in% intersect(name[sex=="M"], name[sex=="F"])]), by=year]
ilir
  • 3,236
  • 15
  • 23
  • Thank you for your answer :) The code on the sample data returns two columns 'year' and 'v1' I presume v1 is the sum of the names that occur as both 'm' and 'f'? When I run it on my own data frame which is quite large it returns the first and the last 5 years. This is fine - will all the years appear though if I decide to plot the information as a graph ? – beck8 Oct 13 '14 at 20:35
  • That is just the print method for `data.table`. The data is all there and you can use it as a normal `data.frame`, or you can convert it with `as.data.frame()`. – ilir Oct 13 '14 at 20:37
  • Just to be clear, this only adds occurrences of names that were common to both girls and boys *in that given year*. If a name is used for girls in one year, and boys in another, then it is not counted. – ilir Oct 13 '14 at 20:45