3

For the following simple dataset;

   row  country year
     1  NLD     2005
     2  NLD     2005       
     3  BLG     2006
     4  BLG     2005
     5  GER     2005
     6  NLD     2007
     7  NLD     2005
     8  NLD     2008

the following code:

df[, .N, by = list(country, year)][,prop := N/sum(N)]

gives the proportion of observations compared to the total of observations. What I want however is to measure the proportion for each country. How should I adapt this code to give me the correct proportions?

Desired output:

   row  country year  prop
     1  NLD     2005   0.6
     2  NLD     2005   0.6    
     3  BLG     2006   0.5
     4  BLG     2005   0.5
     5  GER     2005   1
     6  NLD     2007   0.2
     7  NLD     2005   0.6  
     8  NLD     2008   0.2
Tom
  • 2,173
  • 1
  • 17
  • 44

1 Answers1

5

Using data.table:

df <- read.table(header = T, text = "row  country year
     1  NLD     2005
                 2  NLD     2005       
                 3  BLG     2006
                 4  BLG     2005
                 5  GER     2005
                 6  NLD     2007
                 7  NLD     2005
                 8  NLD     2008")

setDT(df)[, sum := .N, by = country][, prop := .N, by = c("country", "year")][, prop := prop/sum][, sum := NULL]


    row country year prop
1:   1     NLD 2005  0.6
2:   2     NLD 2005  0.6
3:   3     BLG 2006  0.5
4:   4     BLG 2005  0.5
5:   5     GER 2005  1.0
6:   6     NLD 2007  0.2
7:   7     NLD 2005  0.6
8:   8     NLD 2008  0.2
sm925
  • 2,648
  • 1
  • 16
  • 28
  • Thank you very much! I am getting the following error in my actual dataset: `Error in [.data.table(setDT(ES2)[, :=(sum, .N), by = m1a], , :=(prop, : Type of RHS (integer) must match LHS (double). To check and coerce would impact performance too much for the fastest cases. Either change the type of the target column, or coerce the RHS of := yourself (e.g. by using 1L instead of 1).` Could that be the result of 1 of my variables being a factor? – Tom Sep 24 '18 at 13:56
  • Make sure you don't have sum or prop columns in your data.table before running my solution. Then try it once again, I guess it'll work. – sm925 Sep 24 '18 at 14:01
  • I cannot seem to get it to work. As for your comment; there should be no sum column in there. With prop columns you mean something like a float/double? I'm working on very big datasets. Needing to check, change or subset would not be really viable. – Tom Sep 24 '18 at 14:12
  • `dput` subset of your original data set. I'll be able to figure what's going wrong. It's hard like this. Solution works on the data which you provided – sm925 Sep 24 '18 at 14:21