How can I ignore the NA data when I do the lm function?

Question

My question is rather simple, but I could not get it resolved after trying a lot of things.

I have two data frames.

>a
   col1 col2 col3 col4
1    1    2    1    4
2    2   NA    2    3    
3    3    2    3    2    
4    4    3    4    1

> b
  col1 col2 col3 col4
1    5    2    1    4    
2    2   NA    2    3    
3    3   NA    3    2    
4    4    3    4    1

Can I do a lm(a ~ b) to fit the data in a and b?

If I do, how do I ignore the NA data?

Thanks, Dan

score 4 · Answer 1 · edited Nov 23 '10 at 21:17

4

Generally the regression functions in R will only report the results from complete cases, so you do not usually need to do anything special to hold out cases. Your question seems a bit vague, and it is not clear why you are putting an entire matrix (or is that a data.frame?) on the left-hand side of a formula. There is the capability of doing multi-variate analyses with the lm() function, but people who want to do so will generally ask more specific questions.

> lm(a$col1 ~ b$col1+b$col2 +b$col3+b$col4)

Call:
lm(formula = a$col1 ~ b$col1 + b$col2 + b$col3 + b$col4)

Coefficients:
(Intercept)       b$col1       b$col2       b$col3       b$col4  
         16           -3           NA           NA           NA

The tiny amount of data prevents any further estimates after losing 2 cases and only having two left.

edited Nov 23 '10 at 21:17

Alex Brown

41,819
10
94
108

answered Nov 23 '10 at 18:38

IRTFM

258,963
21
364
487

Actually my data set is very big. I am just giving an example. Data a and b are both data frames, with column represents a list of latitudes, and rows longitude. – didimichael Nov 23 '10 at 18:43
Can you give us more information about the problem you are trying to solve? Are you trying to regress *each* column in a on *all* the columns in b? Or each column in a on each column in b? (If you want *all* columns in a on *all* columns in b, then as DWin says above you are really looking at a multivariate analysis ...) – Ben Bolker Nov 23 '10 at 20:27
It sounds from your data structure that you need to look at spatial statistics approaches. See the CRAN Spatial Stats Task View. http://finzi.psych.upenn.edu/views/Spatial.html . That will give you a better map of the regression techniques available for data that is spatially correlated. – IRTFM Nov 23 '10 at 21:13
It seems that na.actions don't operate on dependent variables. Am I missing a convenient function or approach? – Todd D Sep 29 '17 at 01:01

Spacedman · Accepted Answer · 2010-11-24T10:51:38.840

If a and b are data frames, and you want to regress the individual values in a on the values in b, then you need to convert them to vectors. eg:

> lm(as.vector(as.matrix(a))~as.vector(as.matrix(b)))

Call:
lm(formula = as.vector(as.matrix(a)) ~ as.vector(as.matrix(b)))

Coefficients:
            (Intercept)  as.vector(as.matrix(b))  
               8.418239                -0.005241

Missing data is by default dropped - see help(lm) and the na.action parameter. The summary method on an lm object will tell you about dropped observations.

Of course ignoring the spatial correlation likely to be present in spatial data will mean your inferences from the parameter estimates will be quite wrong. Map the residuals. And read a good book on spatial stats...

[Edit: oh, and the data frames have to be all numbers or the whole lot gets converted to characters and then... well, who knows...]

Edit:

Another way of getting vectors from data frames is just to use 'unlist':

> a=data.frame(matrix(runif(16),4,4))
> b=data.frame(matrix(runif(16),4,4))
> lm(a~b)
Error in model.frame.default(formula = a ~ b, drop.unused.levels = TRUE) : 
  invalid type (list) for variable 'a'
> lm(unlist(a)~unlist(b))

Call:
lm(formula = unlist(a) ~ unlist(b))

Coefficients:
(Intercept)    unlist(b)  
     0.6488      -0.3137

I've not seen data.matrix before, thx Gavin.

Re your edit, Spacedman - `data.matrix()` would be a more natural alternative to `as.matrix()` in the above - at least it will handle the Logical | Factor -> Numeric encoding. Can't do anything about true character data in a data frame however... — Gavin Simpson, Nov 24 '10 at 09:26

How can I ignore the NA data when I do the lm function?

2 Answers2