1

I'm looking to create a frequency table in R for a certain variable (INTERVIEW_DAY) , but taking into account another variable as weight (WEIGHT).

I've tried to do that with the package data.table. I would love to do it with the R-Base Package though.

Below you'll find the type of table I want, but still not weighted, which is what I'm looking to learn.

Data (variables TUCASEID, INTERVIEW_DAY, and WEIGHT):

TUCASEID INTERVIEW_DAY    WEIGHT
1  2.00301e+13             5 8155462.7
2  2.00301e+13             6 1735322.5
3  2.00301e+13             6 3830527.5
4  2.00301e+13             4 6622023.0
5  2.00301e+13             4 3068387.3
6  2.00301e+13             4 3455424.9
7  2.00301e+13             1 1637826.3
8  2.00301e+13             2 6574426.8
9  2.00301e+13             6 1528296.3
10 2.00301e+13             4 4277052.8
11 2.00301e+13             6 1961482.3
12 2.00301e+13             7  505227.2
13 2.00301e+13             6 2135476.8
14 2.00301e+13             3 5366309.3
15 2.00301e+13             6 1058351.1

Creating table with the packaged data.table:

df <- setDT(df)
df_freq_table <- df[,.(Freq = .N), by = INTERVIEW_DAY][, Prop := Freq / sum(Freq)][, Cum := cumsum(100 * Prop / sum(Prop))]

My output: df_freq_table[]

 INTERVIEW_DAY Freq       Prop        Cum
1:             5    1 0.06666667   6.666667
2:             6    6 0.40000000  46.666667
3:             4    4 0.26666667  73.333333
4:             1    1 0.06666667  80.000000
5:             2    1 0.06666667  86.666667
6:             7    1 0.06666667  93.333333
7:             3    1 0.06666667 100.000000
halfer
  • 19,824
  • 17
  • 99
  • 186
Econ_Spectre
  • 65
  • 1
  • 7
  • Hi Econ_Spectre, can you show the data before and after? I don't understand what data you are starting from and what you want it to look like. Thanks! – KamRa Jan 01 '20 at 19:00
  • My bad! I'll update the question! – Econ_Spectre Jan 01 '20 at 19:05
  • @Econ_Spectre Is your data.table output the expected – akrun Jan 01 '20 at 19:10
  • @akrun No, it is not. This values are just related to the main variable "INTERVIEW_DAY". I want a table that shows the frequency of the aforementioned variable, weighted by the variable "WEIGHT". Thank you. – Econ_Spectre Jan 01 '20 at 19:12
  • Can you please update with the expected output. How do you want to take the 'weight' into account for the frequency – akrun Jan 01 '20 at 19:13
  • @akrun. This was actually my question. I cannot create that output, this is why I need your help, guys. The values in my question are "pure values" of the variable "INTERVIEW_DAY". Now I need to weight then by the variable "WEIGHT". The kind of weight should be an average weight of "WEIGHT". That would suffice. Thank you. – Econ_Spectre Jan 01 '20 at 19:16
  • @Econ_Spectre. Ok, can you check the update on my answer with `count` – akrun Jan 01 '20 at 19:16
  • @akrun, thank you for such an effort. But you've changed the package, I got another layout. Could we make a table just like that one in the question with your new package 'deplyr'? Moreover, how could I switch the frequency from absolute values to relative values(%)? Ty. – Econ_Spectre Jan 01 '20 at 19:24
  • I updated with the `dplyr` output – akrun Jan 01 '20 at 19:25

1 Answers1

3

In base R, we can make use of xtabs/prop.table. Based on the OP's code, the cumsum is calculated from the order of occurrence of unique valuess in 'INTERVIEW_DAY'. So, to avoid the sorting based on the integer value, convert to factor with levels specified, get the sum of 'WEIGHT' by 'INTERVIEW_DAY' with xtabs, use prop.table to return the proportion, and then apply cumsum on that output

df$INTERVIEW_DAY <- factor(df$INTERVIEW_DAY, levels = unique(df$INTERVIEW_DAY))
tbl1 <- xtabs(WEIGHT ~ INTERVIEW_DAY, df)
Prop <- prop.table(tbl1)
Cum <- cumsum(100 * Prop / sum(Prop))
Cum
#        5         6         4         1         2         7         3 
# 15.71029  39.30705  72.86967  76.02470  88.68935  89.66260 100.00000 

out <- data.frame(INTERVIEW_DAY = names(tbl1), Freq = as.numeric(tbl1),
            Prop = as.numeric(Prop), Cum = as.numeric(Cum))
row.names(out) <- NULL
out
#  INTERVIEW_DAY       Freq        Prop       Cum
#1             5  8155462.7 0.157102906  15.71029
#2             6 12249456.5 0.235967631  39.30705
#3             4 17422888.0 0.335626124  72.86967
#4             1  1637826.3 0.031550297  76.02470
#5             2  6574426.8 0.126646592  88.68935
#6             7   505227.2 0.009732453  89.66260
#7             3  5366309.3 0.103373998 100.00000

If we need a weighted frequency, use count

library(dplyr)
df %>% 
  mutate(INTERVIEW_DAY = factor(INTERVIEW_DAY, levels = unique(INTERVIEW_DAY))) %>%
  count(INTERVIEW_DAY, wt = WEIGHT, sort = FALSE) %>% 
  mutate(Prop = n / sum(n),
         Cum = cumsum(100 * Prop/sum(Prop)))
# A tibble: 7 x 4
#  INTERVIEW_DAY         n    Prop   Cum
#  <fct>             <dbl>   <dbl> <dbl>
#1 5              8155463. 0.157    15.7
#2 6             12249456. 0.236    39.3
#3 4             17422888  0.336    72.9
#4 1              1637826. 0.0316   76.0
#5 2              6574427. 0.127    88.7
#6 7               505227. 0.00973  89.7
#7 3              5366309. 0.103   100. 

Or with data.table

library(data.table)
setDT(df)[, .(Freq = sum(WEIGHT)), by = INTERVIEW_DAY
  ][, Prop := Freq / sum(Freq)][, Cum := cumsum(100 * Prop / sum(Prop))][]
#  INTERVIEW_DAY       Freq        Prop       Cum
#1:             5  8155462.7 0.157102906  15.71029
#2:             6 12249456.5 0.235967631  39.30705
#3:             4 17422888.0 0.335626124  72.86967
#4:             1  1637826.3 0.031550297  76.02470
#5:             2  6574426.8 0.126646592  88.68935
#6:             7   505227.2 0.009732453  89.66260
#7:             3  5366309.3 0.103373998 100.00000

data

df <- structure(list(TUCASEID = c(2.00301e+13, 2.00301e+13, 2.00301e+13, 
2.00301e+13, 2.00301e+13, 2.00301e+13, 2.00301e+13, 2.00301e+13, 
2.00301e+13, 2.00301e+13, 2.00301e+13, 2.00301e+13, 2.00301e+13, 
2.00301e+13, 2.00301e+13), INTERVIEW_DAY = c(5L, 6L, 6L, 4L, 
4L, 4L, 1L, 2L, 6L, 4L, 6L, 7L, 6L, 3L, 6L), WEIGHT = c(8155462.7, 
1735322.5, 3830527.5, 6622023, 3068387.3, 3455424.9, 1637826.3, 
6574426.8, 1528296.3, 4277052.8, 1961482.3, 505227.2, 2135476.8, 
5366309.3, 1058351.1)), class = "data.frame", row.names = c("1", 
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", 
"14", "15"))
akrun
  • 874,273
  • 37
  • 540
  • 662
  • Thank you for you time, @akrun. Could you help me to create the table in the same layout of my question? It would also be great to get it weighted by the varaible "WEIGHT". Would you help me out? – Econ_Spectre Jan 01 '20 at 19:14
  • @Econ_Spectre I updated in `base R`, `data.table` and `dplyr` – akrun Jan 01 '20 at 19:41
  • 1
    Thank you for the great job, @akrun. I already promoted your answer to the official one, because it has nothing to improve. Do you have programming knowledge in Stata as well? – Econ_Spectre Jan 01 '20 at 19:43
  • @Econ_Spectre Thank you. No, I don't use Stata – akrun Jan 01 '20 at 19:44
  • Just another little question: do you know how to order this table according to the values of "INTERVIEW DATE"? Instead of like it it, in ascending order... 1, 2, 3, 4 etc. Thank you a lot. – Econ_Spectre Jan 01 '20 at 21:16
  • @Econ_Spectre In that case, don't do any `factor` conversion, then it would be ordered in the numeric order – akrun Jan 01 '20 at 21:18
  • Perfect, I did it, and it worked like a charm. Thank you. If I may ask you a last little question, because you seem to be very experienced with data analysis. Looking at the column "Freq" above and comparing with the output: 1 | 2,974.1424 14.35 14.35 2 | 3,065.6819 14.80 29.15 3 | 2,919.3688 14.09 43.24 4 |2,916.17392 14.07 57.31 5 |2,941.05299 14.19 71.51 6 | 2,962.0832 14.30 85.80 7 | 2,941.4968 could you imply why they differ? Ty! – Econ_Spectre Jan 01 '20 at 21:37
  • @Econ_Spectre Are you asking about the 'Freq' value output in my post. Not clear about your question. Or are you looking for a way to check whether there is any difference with a statistical test – akrun Jan 01 '20 at 21:42
  • I'm talking about your output. I have another here, with exactly the same values, only the "Freq" column is otherwise filled. The values differ completely. For example, instead of "8155462.7" for the Interview_Day "5" in the "Freq" column of your output I have "2,941.05299". Is that a transformed value? The same value in another unit? Can you imply why that would differ? Thank you again. – Econ_Spectre Jan 01 '20 at 21:51
  • @Econ_Spectre Can you check the `str(df)` in your dataset. In my case, the WEIGHT is numeric – akrun Jan 01 '20 at 21:58
  • in mine it is numeric as well. Why? – Econ_Spectre Jan 01 '20 at 22:28
  • with a different dataset values would be different. here we are doing sum by group – akrun Jan 01 '20 at 22:30
  • @Econ_Spectre. It could be also possible that your dataset `INTERVIEW_DAY` values are floating point and have small difference in values resulting in each value being grouped in a different bin – akrun Jan 01 '20 at 22:34
  • Sorry, I communicated poorly what I'm talking about. Forget your output. The context is: I'm replicating an article from Stata in R. I took your code and applied to my dataset. The values I got were identical to the values of the article, excepting for the column "Freq". The article code's output for INTERVIEW_DAY "5" was: 2,941.05299. My output with your code for INTERVIEW_DAY "5" was: 11672116808. My question to you were: do you have any idea why that? Might be that a transformation of the number? Another unit? Sorry for the mistake, I'm pretty tired. – Econ_Spectre Jan 01 '20 at 23:07
  • @Econ_Spectre Can you check whether they were using similar way for calculation. It may be the `sum` as we used for weighted frequency. Could be `weighted.mean` or some other transformations – akrun Jan 01 '20 at 23:09
  • ty @akrun. I know you said you don't use Stata, but the line for the frequency table was "tab interview_day [aw=weight], matcell(freq)". I guess, the "AW" stands for "average weight". or weighted mean as you pointed out. I'm using this line of yours: Table_WEIGHT <- xtabs(WEIGHT ~ INTERVIEW_DAY, timeuse_2003) Prop <- prop.table(Table_WEIGHT) Cum <- cumsum(100 * Prop / sum(Prop)) Cum out <- data.frame(INTERVIEW_DAY = names(Table_WEIGHT), Freq = as.numeric(Table_WEIGHT), Prop = as.numeric(Prop), Cum = as.numeric(Cum)) out – Econ_Spectre Jan 01 '20 at 23:22
  • @Econ_Spectre. Is it using analytic weights i.e. https://www.parisschoolofeconomics.eu/docs/dupraz-yannick/using-weights-in-stata(1).pdf or https://www.stata.com/help10.cgi?weight – akrun Jan 01 '20 at 23:28
  • Exactly. You got it. How to transform it into an "analytic weights" in R? – Econ_Spectre Jan 01 '20 at 23:54
  • @Econ_Spectre Can you check [here](https://rstudio-pubs-static.s3.amazonaws.com/279455_1ca98bc2ce5f45ed9ea20a647ef0c4b1.html) – akrun Jan 02 '20 at 00:01
  • can I derive this commands from regressions to include them in the frequency table? The link you send explains how to make regressions with the Stata AW. Thank you again! – Econ_Spectre Jan 02 '20 at 00:20
  • @Econ_Spectre Can you post as a new question. thanks – akrun Jan 02 '20 at 16:18
  • I'm off for holidays and will do it as soon as I get home! Thank you @akrun! – Econ_Spectre Jan 03 '20 at 13:39
  • @Econ_Spectre Happy holidays and belated Happy New Year! – akrun Jan 03 '20 at 17:11
  • 1
    for you too, mate! – Econ_Spectre Jan 03 '20 at 20:32
  • 1
    Hi @akrun. I'll post another question as you requested. Thank you again! – Econ_Spectre Jan 06 '20 at 20:18