regression lm on R qualitative and quantitative variables

Question

I am trying to do a linear regression on R, with qualitative and quantitative variables. Here is the structure of dataset :

trip                             X                 year

A 10                          500                  2019
B 11                          600                  2019
C 9                           450                  2020

I want to do a regression of "A" (y, variable to explain) on the other variables (explicative variables). The problem being that with a qualitative variable like "spécialité", it then treats other variables like qualitative variables to, and thus does a regression separately on year 2019, year 2020,...

lm_amb <- lm(reg_spé_amb[,3] ~ reg_spé_amb[,2] + reg_spé_amb[,1] + reg_spé_amb[,4])
summary(lm_amb)

I thus obtain a result in the following form :

                                                 Estimate            std error
                                                          
intercept                                         400                   26
reg_spé_amb[, 2]10                                 88                   66
reg_spé_amb[, 2]11                                 64                   10
reg_spé_amb[, 1]A                                  70                   80
reg_spé_amb[, 4]2019                               80                   90

I would like to get a coefficient per A, one for the variable "year" as a whole and not as separate variables for each year, and one for the ("A"). Could anyone help me with that ?

dipetkov · Answer 1 · 2022-06-25T19:05:31.643

Your question suggests you are just starting with both statistics and R. You'll benefit from first learning the basics and there are many resources available online. Take a look for example at Modern Statistics with R.

With that advice out of the way, let's look at your questions.

You provided just three rows of data which is not enough to estimate the parameters of the proposed regression. So I add three rows of fake data.

library("tidyverse")

reg_spé_amb <-
  tibble::tribble(
    ~spécialité, ~`nombre de praticiens`, ~`activité en valeur`, ~année,
    "anesthésie", 10L, 500L, 2019L,
    "chirurgie", 11L, 600L, 2019L,
    "anesthésie", 9L, 450L, 2020L,
    "chirurgie", 9L, 550L, 2019L,
    "chirurgie", 10L, 450L, 2019L,
    "anesthésie", 10L, 650L, 2020L
  )

It's convenient to fit a model to the data using the formula interface. R keeps track of lots of information for us, including the variable names, so there is no need for awkward specifications such as reg_spé_amb[,1].

lm_amb <- lm(
  formula = `activité en valeur` ~ `nombre de praticiens` + spécialité + année,
  data = reg_spé_amb
)

After we fit the model, we validate it. This step is crucial because what's the point in extracting coefficients from a model that's a poor fit to the data? See 8.1.4 Model diagnostics in "Modern Statistics with R".

[Lots of work to validate the model...]

Once we are satisfied with the model, we look at the coefficients and various other statistics.

Here is how we can get the coefficient for the linear effect of year (année) and number of doctors (nombre de praticiens) on the response (activité en valeur).

summary(lm_amb)
#> 
#> Call:
#> lm(formula = `activité en valeur` ~ `nombre de praticiens` + 
#>     spécialité + année, data = reg_spé_amb)
#> 
#> Residuals:
#>          1          2          3          4          5          6 
#> -3.197e-14  6.667e+00 -7.000e+01  7.667e+01 -8.333e+01  7.000e+01 
#> 
#> Coefficients:
#>                          Estimate Std. Error t value Pr(>|t|)
#> (Intercept)            -161620.00  272131.90  -0.594    0.613
#> `nombre de praticiens`      60.00      67.33   0.891    0.467
#> spécialitéchirurgie         33.33     122.93   0.271    0.812
#> année                       80.00     134.66   0.594    0.613
#> 
#> Residual standard error: 106.5 on 2 degrees of freedom
#> Multiple R-squared:   0.32,  Adjusted R-squared:   -0.7 
#> F-statistic: 0.3137 on 3 and 2 DF,  p-value: 0.819

Or more succinctly and with better table formatting:

broom::tidy(lm_amb)
#> # A tibble: 4 × 5
#>   term                    estimate std.error statistic p.value
#>   <chr>                      <dbl>     <dbl>     <dbl>   <dbl>
#> 1 (Intercept)            -161620.   272132.     -0.594   0.613
#> 2 `nombre de praticiens`      60        67.3     0.891   0.467
#> 3 spécialitéchirurgie         33.3     123.      0.271   0.812
#> 4 année                       80.0     135.      0.594   0.613

It's just as easy to get one coefficient (marginal effect) for each level of the categorical variable specialty (spécialité).

emmeans::emmeans(lm_amb, ~ spécialité)
#>  spécialité emmean   SE df lower.CL upper.CL
#>  anesthésie    530 65.4  2      248      812
#>  chirurgie     563 89.8  2      177      950
#> 
#> Results are averaged over the levels of: année 
#> Confidence level used: 0.95

As a bonus, here is how to compare the marginal effects of the two specialties. Note that comparison anesthésie - chirurgie is what we get from the summary table above as the coefficient spécialitéchirurgie. The sign is flipped because spécialitéchirurgie is actually the comparison chirurgie - anesthésie.

emmeans::emmeans(lm_amb, pairwise ~ spécialité)
#> $contrasts
#>  contrast               estimate  SE df t.ratio p.value
#>  anesthésie - chirurgie    -33.3 123  2  -0.271  0.8117
#> 
#> Results are averaged over the levels of: année

Understanding all of the above will help you to understand how to fit and interpret a regression model.

^{Created on 2022-06-25 by the reprex package (v2.0.1)}

this doesn't work, because i think this isn't the right model for what i am searching for, i get the same thing (aka it treats the number of doctors as quali variables, thus giving a coefficient for the variable "12 doctors", one for "11 doctors" etc. Here is a reformulation of my question using a fixed effects model : https://stackoverflow.com/questions/72743056/fixed-effects-multivariate-model-r — user19304348, Jun 27 '22 at 08:03
This is the model you suggested yourself. I just wanted to point out there is an easier way to fit and diagnose models in R. In my experience, it's hard to do good analysis before knowing how to use one's statistical tool of choice, R in your case. Jumping into fitting complex models before knowing the basics is not necessarily a great idea either. — dipetkov, Jun 27 '22 at 08:04
Another hint: You can check the data type of each column with `str(reg_spé_amb)`. Make sure that the `nombre de praticiens` column is numeric. — dipetkov, Jun 27 '22 at 08:19
i still have the same pb, and i think its because of the model i designed. Thanks for the tip, ill be trying to use a fixed effects model as i think it is better suited for what im searching for — user19304348, Jun 27 '22 at 10:13

regression lm on R qualitative and quantitative variables

1 Answers1