0

I am trying to run a multiple regression analysis to find the impact of water quality on plankton abundance in a specific location (aka Guzzler). I was able to get my model to run and the summary, however the data is non-parametric so a typical summary would not be reliable. This is mainly due to having a small sample size as it was done over the course of a few weeks and one sample each week.

I was then thinking the non-parametric version of this could be a bootstrap. I've run bootstraps on other data before but never a multiple regression model. I can't seem to find code on how to go about this so began with how I've performed bootstraps in the past. I was curious what I would need to edit in order to get this bootstrap to run.

Here is the output from dput(head(Guzzler1):

    structure(list(Abundance = c(98L, 43L, 65L, 55L, 54L), Phospates = c(2L, 
2L, 2L, 2L, 2L), Nitrates = c(0, 0.3, 0, 0.15, 0), pH = c(7.5, 
8, 7.5, 7, 7)), .Names = c("Abundance", "Phospates", "Nitrates", 
"pH"), row.names = c(NA, 5L), class = "data.frame")

Here is my model & the summary:

    Guzzler1model<-lm(Abundance ~ Phospates + Nitrates + pH, data=Guzzler1)
> summary(Guzzler1model)

Call:
lm(formula = Abundance ~ Phospates + Nitrates + pH, data = Guzzler1)

Residuals:
     1      2      3      4      5 
 20.75  -4.25 -12.25   8.50 -12.75 

Coefficients: (1 not defined because of singularities)
            Estimate Std. Error t value Pr(>|t|)
(Intercept)   -80.25     209.62  -0.383    0.739
Phospates         NA         NA      NA       NA
Nitrates     -135.00      90.02  -1.500    0.272
pH             21.00      28.87   0.727    0.543

Residual standard error: 20.41 on 2 degrees of freedom
Multiple R-squared:  0.5302,    Adjusted R-squared:  0.06032 
F-statistic: 1.128 on 2 and 2 DF,  p-value: 0.4698

***Please note: I believe phosphates have NA because each value was equal to 2 in this particular location.

Here is how I was originally performing a bootstrap and unsure what to change:

n=length(Guzzler1Abundance)
B = 1000
results = numeric(B)
for(b in 1:B){
  i = sample(x = 1:n, size=n, replace=TRUE)
  bootSample = Guzzler1Abundance[i]
  thetahat= mean(bootSample)
  results[b] = thetahat
}

Thank you so much in advance!

StupidWolf
  • 45,075
  • 17
  • 40
  • 72
  • I think you need to include a larger data set to support a bootstrap demonstration. 5 cases with a four-component model is likely to error out. – IRTFM Mar 02 '19 at 19:49
  • @42- unfortunately I can't include any more data since this is for a project for some students. Do you have a suggestion for a non-parametric version of the regression model I ran? – Victoria Assad Mar 02 '19 at 20:48
  • This is a godd starting point: https://socialsciences.mcmaster.ca/jfox/Books/Companion/appendices/Appendix-Bootstrapping.pdf – Marco Sandri Mar 02 '19 at 22:24

1 Answers1

0

I am not quite shure what you mean by non-parametric data but I understand, you want to take bootstrapped samples from your data and perform linear regression with it.

A possible way to do that would be

Guzzler1 <-  structure(list(Abundance = c(98L, 43L, 65L, 55L, 54L), Phospates = c(2L, 
                      2L, 2L, 2L, 2L), Nitrates = c(0, 0.3, 0, 0.15, 0), pH = c(7.5, 
                      8, 7.5, 7, 7)), .Names = c("Abundance", "Phospates", "Nitrates", 
                      "pH"), row.names = c(NA, 5L), class = "data.frame")
lines <- nrow(Guzzler1)
replicate(5, lm(Abundance ~ Phospates + Nitrates + pH, 
                         data=Guzzler1[sample(lines, replace = TRUE),])$coefficients)

This will report the coefficents from 5 linear regressions like this

> replicate(5, lm(Abundance ~ Phospates + Nitrates + pH, 
+                          data=Guzzler1[sample(lines, replace = TRUE),])$coefficients)
                 [,1]       [,2]      [,3]       [,4] [,5]
(Intercept)  65.00000 145.000000 -408.0000 145.000000 -100
Phospates          NA         NA        NA         NA   NA
Nitrates    -73.33333   6.666667 -256.6667   6.666667 -110
pH                 NA -13.000000   66.0000 -13.000000   22

The number of 5 replicates can be chosen arbitrarily higher by changing the first argument of my replicate call. The many NA values are due to the scarce data, as @IRTFM predicted and explained. This will improve as more data is going to be sampled.

Let's sample 5000 bootstrap samples and investigate the distribution of the Nitrate coefficients:

reps <- replicate(5000, lm(Abundance ~ Phospates + Nitrates + pH, 
                           data=Guzzler1[sample(lines, replace = TRUE),])$coefficients)
plot(table(reps["Nitrates",]))
plot(ecdf(reps["Nitrates",]))
quantile(reps["Nitrates",], c(.025, .25, .5, .75, .975), na.rm = TRUE)

Phosphates can be edited to this, one there are variant data:

boxplot(reps["(Intercept)",], reps["Nitrates",], reps["pH",]
        , names = c("Intercept", "Nitrates", "pH"), ylab="bootstrapped coefficients")
abline(h=0, col="firebrick", lty=3)

enter image description here

Bernhard
  • 4,272
  • 1
  • 13
  • 23