-1

Hi I have a huge dataframe (df) whose names are different Tenors, in the columns I have values for each tenor. For the last two tenors I have some missing rows which I want to complete based on the given data for the non-missing rows. My dataframe looks like this:

  1095          1825        2555        3650        5475        7300        10950
  0.00116034    0.00170552  0.00274189  0.00472176  0.00697495  NA          NA
  0.00112157    0.00188056  0.00295159  0.0050669   0.00728063  0.00816778  0.00842034
  0.00138009    0.00225073  0.00339548  0.00549386  0.00780401  0.00871812  0.00897222

I am stuck in using predict() and lm. I want to obtain those missing values. Sorry for this basic question, but I am in a hurry, and I have been stuck for over an hour.

Thanks in advance.

EDIT I want to create a linear model with a data frame, lets say df2

   df2 <-df[rowSums(is.na(df)) > 0,])

And use predict to find the missing values for 7300, 10950.

EDIT2:

Thanks to @Zheyuan Li I've gone through some progress, but I can't get my predicted data, I have tried to use two options:

b<-setNames(stack(df2),c("value", "Tenor"))
data.lm <- lm(value~Tenor, data = b, na.action = na.exclude)
pred<-predict(data.lm)

If I execute this code, I get the pred with the same values as b.

In the other hand, if I use the following code, I obtain the same values for all predicted values.

aov <- aov(data.lm,data=b)
pred<-predict(aov)

EDIT3:

I have adapted my code and removed the last column to make things easier. Now I have the following data:

  1095          1825        2555        3650        5475        7300        
  0.00116034    0.00170552  0.00274189  0.00472176  0.00697495  NA          
  0.00112157    0.00188056  0.00295159  0.0050669   0.00728063  0.00816778  
  0.00138009    0.00225073  0.00339548  0.00549386  0.00780401  0.00871812  

My new code looks like this:

setDT(df)


variables<-setdiff(names(df),c('7300',"DATE"))

y_var<-"7300"
Line<-function (train_dat, test_dat, variables, y_var, family = "gaussian") 
{
  fm <- as.formula(paste(y_var, " ~", paste(variables, collapse = "+")))
  glm1 <- glm(fm, data = train_dat, family = family)
  pred <- predict(glm1, newdata = test_dat)
  return(pred)
}

df[is.na(`7300`),`7300`:=
         Line(train_dat=df[!is.na(`7300`),],
              test_dat=df[is.na(`7300`)],
              variables,
              y_var)
       ]

Now I get the following error:

Error in terms.formula(formula, data = data) : 
  invalid term in model formula

Do you know how to solve it?

arodrisa
  • 300
  • 4
  • 17
  • If you want to impute it based on the `mean`, `median` etc of a column, check `na.aggregate` from `zoo` – akrun Oct 06 '16 at 14:27
  • I want to create the linear model based on the rows with complete data. I dont want only to extrapolate based on the names of the dataframe, but use the same movements for the known rows. Is it clear? – arodrisa Oct 06 '16 at 14:31
  • well, you should work on the transposed of your data.frame... – agenis Oct 06 '16 at 14:32
  • Not the mean. I want to create a linear model with a data frame, lets say df2 (df2 <-df[rowSums(is.na(df)) > 0,]). And use predict to find the missing values for 7300, 10950. Is is more clear? – arodrisa Oct 06 '16 at 14:38
  • maybe look at `VIM::irmi` it is an iterative imputation based on regressing each column at a time. – agenis Oct 06 '16 at 14:40
  • Yes I do, and how can I create the ANOVA model using all the dataframe? – arodrisa Oct 06 '16 at 14:58
  • Yes, what I mean is that I don't know how to use aov (I am a newby in R, and my background in programming is assembler), I don't know how to make the formula specifying the model. I have tried all the things are on my mind using df2 to create the model. – arodrisa Oct 06 '16 at 15:10
  • 1
    @ZheyuanLi, thanks for link and tips. I am getting closer. I will edit my question with the advances I am making – arodrisa Oct 06 '16 at 15:28
  • @arodrisa the issue is that the formula is recognizing the numbers as integers and not strings, I don't know if you are keen on doing this, but you might need to change the names of your columns using `setnames` – Jason Oct 07 '16 at 15:25
  • I thought about it, but I discarded it as I obtained that they where strings when using str(names(df)) – arodrisa Oct 08 '16 at 06:52

1 Answers1

0

I'm not certain this is what you are looking for, but here is how you might impute some missing data using a sample data set

data(mtcars) #pulling in some data to help out

# setting up missing data -------------------------------------------------
set.seed(1)
setDT(mtcars) #i like to work in data.table
sa<-sample(2,nrow(mtcars),prob=c(.1,.9),replace=T)
mtcars[,mpg_na:=mpg]
mtcars[sa==1,mpg_na:=NA]

now we have some missing data!

    mpg cyl  disp  hp drat    wt  qsec vs am gear carb mpg_na
 1: 21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4   21.0
 2: 21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4   21.0
 3: 22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1   22.8
 4: 21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1     NA
 5: 18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2   18.7 
 6: 18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1   18.1 
 7: 14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4     NA 

# function to predict using linear model ----------------------------------------

Line<-function (train_dat, test_dat, variables, y_var, family = "gaussian") 
{
  fm <- as.formula(paste(y_var, " ~", paste(variables, collapse = "+")))
  glm1 <- glm(fm, data = train_dat, family = family)
  pred <- predict(glm1, newdata = test_dat)
  return(pred)
}

variables<-setdiff(names(mtcars),c('mpg','mpg_na'))
y_var<-'mpg'

mtcars[!is.na(mpg_na),mpg_pred:=mpg] #setting up 
mtcars[is.na(mpg_na),mpg_pred:=
         Line(train_dat=mtcars[!is.na(mpg_na),],
              test_dat=mtcars[is.na(mpg_na)],
              variables,
              y_var)
       ]

How did we do???

    mpg cyl  disp  hp drat    wt  qsec vs am gear carb mpg_na mpg_pred
 1: 21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4   21.0 21.00000
 2: 21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4   21.0 21.00000
 3: 22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1   22.8 22.80000
 4: 21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1     NA 21.48683
 5: 18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2   18.7 18.70000
 6: 18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1   18.1 18.10000
 7: 14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4     NA 16.91128

There are other observations to look at but the two shown here (row 4 and 7) look reasonably close

Jason
  • 1,559
  • 1
  • 9
  • 14
  • thanks for the piece of code, I have made my code similar, but still have an error. I have edited the question – arodrisa Oct 07 '16 at 10:29