3

Update: The fixest package now includes a function obs() that retrieves the indices of the observations used in a regression.

Original post: I would like to obtain the estimation sample from a model object, i.e. the observations that were not dropped due to missing values. This seems to be simple for standard lm regressions (using case.names()) but less so for more recent packages such as fixest.

Is there any general way to access the estimation sample, irrespective of the package used for estimation?

My attempts for both lm and fixest objects are:

library(tidyverse)
library(insight)
library(fixest)

# create data with NA -----------------------------------------------------

dat <- mtcars %>% 
  as_tibble(rownames = "model") %>% 
  mutate(cyl = na_if(cyl, 4))


# lm ----------------------------------------------------------------------

mod_lm <- lm(mpg ~ cyl * disp, data = dat)
obs <- as.integer(case.names(mod_lm))

dat %>% 
  filter(row_number() %in% obs)
#> # A tibble: 21 x 12
#>    model         mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>    <chr>       <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#>  1 Mazda RX4    21       6  160    110  3.9   2.62  16.5     0     1     4     4
#>  2 Mazda RX4 …  21       6  160    110  3.9   2.88  17.0     0     1     4     4
#>  3 Hornet 4 D…  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1
#>  4 Hornet Spo…  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2
#>  5 Valiant      18.1     6  225    105  2.76  3.46  20.2     1     0     3     1
#>  6 Duster 360   14.3     8  360    245  3.21  3.57  15.8     0     0     3     4
#>  7 Merc 280     19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4
#>  8 Merc 280C    17.8     6  168.   123  3.92  3.44  18.9     1     0     4     4
#>  9 Merc 450SE   16.4     8  276.   180  3.07  4.07  17.4     0     0     3     3
#> 10 Merc 450SL   17.3     8  276.   180  3.07  3.73  17.6     0     0     3     3
#> # … with 11 more rows


# fixest ------------------------------------------------------------------

mod_fe <- fixest::feols(mpg ~ cyl * disp, data = dat)
#> NOTE: 11 observations removed because of NA values (RHS: 11).

# does not work
case.names(mod_fe)
#> NULL

# remove missing values manually for all variables used in the regression
vars <- find_predictors(mod_fe, flatten = TRUE)

dat %>% 
  filter(if_all(
    all_of(vars),
    ~ !is.na(.x)
  ))
#> # A tibble: 21 x 12
#>    model         mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>    <chr>       <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#>  1 Mazda RX4    21       6  160    110  3.9   2.62  16.5     0     1     4     4
#>  2 Mazda RX4 …  21       6  160    110  3.9   2.88  17.0     0     1     4     4
#>  3 Hornet 4 D…  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1
#>  4 Hornet Spo…  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2
#>  5 Valiant      18.1     6  225    105  2.76  3.46  20.2     1     0     3     1
#>  6 Duster 360   14.3     8  360    245  3.21  3.57  15.8     0     0     3     4
#>  7 Merc 280     19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4
#>  8 Merc 280C    17.8     6  168.   123  3.92  3.44  18.9     1     0     4     4
#>  9 Merc 450SE   16.4     8  276.   180  3.07  4.07  17.4     0     0     3     3
#> 10 Merc 450SL   17.3     8  276.   180  3.07  3.73  17.6     0     0     3     3
#> # … with 11 more rows

Created on 2021-06-09 by the reprex package (v2.0.0)

dufei
  • 2,166
  • 1
  • 7
  • 18

2 Answers2

1

Generic function case.names has no method written for objects of class "fixest". The solution is to look at str(mod_fe) and write your own method.

case.names.fixest <- function(object, ...){
  no <- object$obsRemoved
  seq_len(object$nobs_origin)[-no]
}

case.names(mod_fe)
# [1]  1  2  4  5  6  7 10 11 12 13 14 15 16 17 22 23 24 25 29 30 31
Rui Barradas
  • 70,273
  • 8
  • 34
  • 66
  • So in other words there could not possibly be a function that extracts the estimation sample irrespective of the model class because each object class could be structured arbitrarily. And to make a generic such as case.names work for a given model class, I'd either have to define it in my own environment or ask the developer of the class to include the method in his package? – dufei Jun 11 '21 at 08:42
  • 1
    @dufei Yes, that's about it. `case.names` is in base package `stats` and the R Core Team cannot guess how each package developer will structure their modeling functions' return values. It's up to them to write methods. If you contact `maintainer("fixest")` you (and all others) might get it in the next package version. – Rui Barradas Jun 11 '21 at 09:25
0

This answer is specifically for the fixest package. using mod_fe$fixef_id can extract the estimation sample. However, the identifier are saved as attribution. To extract them, save attr(mod_fe$fixef_id,"fixef_names") as a stand-alone variable. The developers probably have added this feature.

Roshin Raphel
  • 2,612
  • 4
  • 22
  • 40
Chang
  • 1
  • 1
  • No. As @RuiBarradas states explicitly, the accepted answer creates a function to provide the requested information. The function is *not* part of the `fixest` package. – Limey Mar 09 '23 at 13:08