0

How do I store the estimation sample from a fixest::feols regression so that I can calculate summary statistics? In Stata this can be done with e(sample), eg. sum y if e(sample) to calculate the mean of the dependent variable on the estimation sample.

From a previous question, I see that obs(model) can be used to store the estimation sample to run further regressions using subset, but I don't see how to use it to calculate summary statistics, because obs(model) returns integers instead of Booleans.

Macaulay
  • 59
  • 1
  • 8

2 Answers2

0

R is not Stata: It looks like this question is more a general question on how to report descriptive statistics in R.

There are many many ways to report desc stats. I provide an example below illustrating how to curate the desc stats to the estimation sample.

library(fixest)
base = setNames(iris, c("y", "x1", "x2", "x3", "species"))
# let's add a few missing values
base$x1[1:5] = NA
# estimation
est = feols(y ~ x1, base)
#> NOTE: 5 observations removed because of NA values (RHS: 5).

# There is an infinity of ways to get descriptive stats
# As an example, let's use the collapse pkg
library(collapse)
descr(base[obs(est), all.vars(est$fml)])
#> Dataset: all.vars(x$fml), 2 Variables, N = 145
#> ------------------------------------------------------- 
#> y (numeric):
#> Statistics
#>     N  Ndist  Mean    SD  Min  Max  Skew  Kurt
#>   145     35  5.88  0.82  4.3  7.9  0.27  2.46
#> Quantiles
#>    1%    5%  10%  25%  50%  75%  90%   95%  99%
#>   4.4  4.62  4.9  5.2  5.8  6.4  6.9  7.28  7.7
#> -------------------------------------------------------
#> x1 (numeric):
#> Statistics
#>     N  Ndist  Mean    SD  Min  Max  Skew  Kurt
#>   145     23  3.05  0.44    2  4.4  0.35  3.19
#> Quantiles
#>    1%    5%  10%  25%  50%  75%   90%  95%   99%
#>   2.2  2.32  2.5  2.8    3  3.3  3.66  3.8  4.16
#> ---------------------------------------------------------

Further, you can very easily automate that with a function, as below:

#Let's create a function that automates that
summ = function(x){
  # x: fixest estimation
  
  # we fetch the data
  data = model.matrix(x, type = c("lhs", "rhs"))
  # we remove the intercept
  var_keep = names(data) != "(Intercept)"
  data = data[var_keep]
  # the sumstat
  descr(data)
}

summ(est)
#> Dataset: data, 2 Variables, N = 145
#> ----------------------------------------------------- 
#> y (numeric):
#> Statistics
#>     N  Ndist  Mean    SD  Min  Max  Skew  Kurt
#>   145     35  5.88  0.82  4.3  7.9  0.27  2.46
#> Quantiles
#>    1%    5%  10%  25%  50%  75%  90%   95%  99%
#>   4.4  4.62  4.9  5.2  5.8  6.4  6.9  7.28  7.7
#> ----------------------------------------------------- 
#> x1 (numeric):
#> Statistics
#>     N  Ndist  Mean    SD  Min  Max  Skew  Kurt
#>   145     23  3.05  0.44    2  4.4  0.35  3.19
#> Quantiles
#>    1%    5%  10%  25%  50%  75%   90%  95%   99%
#>   2.2  2.32  2.5  2.8    3  3.3  3.66  3.8  4.16
#> ------------------------------------------------------

If you would prefer a table, you can easily adapt the code to work with datasummary:

library(modelsummary)
summ = function(x){
  # x: fixest estimation
  
  # we fetch the data
  data = model.matrix(x, type = c("lhs", "rhs"))
  # we remove the intercept
  var_keep = names(data) != "(Intercept)"
  data = data[var_keep]
  # the sumstat
  datasummary(All(data) ~ Mean + SD + Min + Median + Max, data, output = "dataframe")
}

summ(est)
#>      Mean   SD  Min Median  Max
#> 1  y 5.88 0.82 4.30   5.80 7.90
#> 2 x1 3.05 0.44 2.00   3.00 4.40
Laurent Bergé
  • 1,292
  • 6
  • 8
0

One solution is filtering on the indices of the dataframe:

df %>% filter(row_number() %in% obs(model))
   %>% summarize(y_mean = mean(y))
Macaulay
  • 59
  • 1
  • 8