0

I want to calculate the impact that height has on earnings given the gender. I divided my data into data for male and female but when I run the lm(earnings~height+education+age, data = data_female) function it gives me an error saying: Error in model.frame.default(formula = earnings ~ height + education + : variable lengths differ (found for 'education')

Would you be able to help in either suggesting a better way to refine my model or helping to fix this particular error? Please let me know.

setwd("~/Google Drive/R Data")
data <- read.csv('data_ass5.csv')
height <- data$height
earnings <- data$earnings
gender <- data$sex
age <- data$age
education <- data$educ
multiple_regression <- lm(earnings~height+age+gender+education,data = data)
lm(earnings~height+age+gender+education,data = data)
summary(multiple_regression)
summary(linear_regression)
multiple_regression_redefined <- lm(earnings~age+gender+education,data = data)

# Now I wish to particularly assess the impact of gender on earnings
# therefore  trying to refine my model doing the following: 
# but the lm last line is causing an error. Would you be able to adivse on 
# if this is the correct way to refine it and/or why I am getting the error.
# I even tried putting na.rm=TRUE after the lm code, but error still. 

data_female <- subset(data,gender==0)
data_male <- subset(data,gender==1)
lm(earnings~height+education+age, data = data_female)

smn
  • 1
  • 1
  • 1
    Hello SMN, welcome to Stackoverflow. Please read [How to ask and answer homework questions](https://meta.stackoverflow.com/questions/334822/how-do-i-ask-and-answer-homework-questions) and update your question accordingly. – Len Greski Mar 17 '20 at 00:40
  • 1
    Delete all of these and similar commands: `height <- data$height` – Edward Mar 17 '20 at 00:51
  • I have edited the way in which I am asking my question --- thank you for directing me to that page. Would you be able to suggest why I am going wrong now and where? – smn Mar 17 '20 at 19:01

1 Answers1

0

Per docs of lm, the data argument handles variables in formula in two ways that are NOT mutually exclusive:

data
an optional data frame, list or environment (or object coercible by as.data.frame to a data frame) containing the variables in the model. If not found in data, the variables are taken from environment(formula), typically the environment from which lm is called.

Specifically, all your vector assignments are redundant and overlap with column names in the data frame except for gender and education:

height <- data$height
earnings <- data$earnings
gender <- data$sex
age <- data$age
education <- data$educ

multiple_regression <- lm(earnings~height+age+gender+education,data = data)

When above is run, all referenced names except for gender and education derive from the dataframe. But gender and education is pulled from the global environment for the vectors you assigned above. Had you used sex and educ, values would be pulled from the data frame like all the others.

Relatedly, your subset calls use the gender vector and not sex column. Fortunately, they are the exact same that no errors or undesired results occurred.

data_female <- subset(data,gender==0)
data_male <- subset(data,gender==1)

Therefore, when you subsetted your data, lm is pulling all values from the subsetted data and one value, education, from global environment. But remember education is based on the full data frame so maintains a larger length than the columns of subsetted data frame.


Altogether, simply avoid assigning the redundant vectors and use columns for full and subsetted data frames.

height <- data$height

earnings <- data$earnings

gender <- data$sex

age <- data$age

education <- data$educ

# REPLACE gender WITH sex AND education WITH educ (RENAME COLS IF NEEDED)
multiple_regression <- lm(earnings ~ height + age + sex + educ, data = data)

# REPLACE gender WITH sex
data_female <- subset(data, sex==0)
data_male <- subset(data, sex==1)

# REPLACE education WITH educ
lm(earnings ~ height + educ + age, data = data_female)
Parfait
  • 104,375
  • 17
  • 94
  • 125