2

I have a dataset of fish abundance data on which i want to perform a regression analysis. However, i want to perform a lot of regressions on different subsets of the data ,without having to do this manually, and save the coefs and P value in a new data frame.

Data is structured as follows (example):

site year species abund
a    2011  a      3
a    2016  b      5
b    2011  a      4
b    2015  a      9
a    2018  b      1
c    2010  a      2
b    2016  c      3
c    2012  a      1

In total i have 883 rows, 21 unique sites, 41 unique species and 8 different years.

I want a regression model for every species-site combination. (every combination has at least 5 observations) Model looks like:

lm(abund ~ year)  

but one model for every species for every site. So one model for species a in site a, one for species b in site a, one for species a in site b etc.

There are several topics on this on stack, but non seem to fit my needs. My idea was to use a for loop, but i can't get this to work properly. Have been tweeking all day, but can't get it to work.

slopes <- numeric(nrow(df))

for (i in 1:nrow(df)) {
  y <- as.numeric(df[i,4]) # row 4 is the abundancy data
  x <- df([i, 1]) # row 1 is the year data
  slopes[i] <- coef(lm(y ~ x))[2]
}

My question: How can i conduct linear regression models for all unique site-species combinations and store the coefs and P value in a new dataframe? Preferably by using an improved version of my failed attempt.

Thanks in advance

dput of sample of 50 random rows:

df <- structure(list(year = c(2015, 2012, 2017, 2014, 2018, 2018, 2012, 
    2018, 2013, 2013, 2012, 2018, 2012, 2018, 2016, 2013, 2018, 2019, 
    2012, 2019, 2017, 2014, 2013, 2014, 2018, 2016, 2013, 2019, 2019, 
    2018, 2019, 2014, 2012, 2018, 2017, 2016, 2017, 2015, 2017, 2019, 
    2012, 2016, 2019, 2019, 2018, 2014, 2012, 2015, 2012, 2012), 
        species = c("Aal", "Brasem", "Kolblei", "Dunlipharder", "Snoekbaars", 
        "Snoekbaars", "Paling", "Baars", "Tong", "Sprot", "Paling", 
        "Kolblei", "Baars", "Sprot", "Tong", "Baars", "Baars", "Zwartbekgrondel", 
        "Dikkopje", "Snoekbaars", "Blankvoorn", "Kolblei", "Kolblei", 
        "Baars", "Aal", "Kolblei", "Bot", "Snoekbaars", "Baars", 
        "Blankvoorn", "Zeebaars", "Snoekbaars", "Zeebaars", "Bot", 
        "Snoekbaars", "Bot", "Baars", "Baars", "Aal", "Snoekbaars", 
        "Baars", "Baars", "Bot", "Bot", "Bot", "Kleine koornaarvis", 
        "Snoekbaars", "Bot", "Blankvoorn", "Kleine koornaarvis"), 
        site = c("Amerikahaven (kop Aziëhaven)", "Het IJ (thv EyE)", 
        "Westhaven en ADM-haven (kop)", "Jan van Riebeeckhaven (thv Nuon)", 
        "Coenhaven", "Mercuriushaven", "Amerikahaven (kop Aziëhaven)", 
        "Mercuriushaven", "Amerikahaven kop Australiëhaven", "Het IJ (thv het Keerkringpark)", 
        "Amerikahaven kop Australiëhaven", "Coenhaven", "Jan van Riebeeckhaven (thv Nuon)", 
        "Het IJ (thv EyE)", "Westhaven en ADM-haven (kop)", "Het IJ (thv het Keerkringpark)", 
        "Jan van Riebeeckhaven (kop NZK)", "Het IJ (thv EyE)", "Westhaven en ADM-haven (kop)", 
        "Het IJ (thv EyE)", "Het IJ (thv EyE)", "Het IJ (thv het Keerkringpark)", 
        "Westhaven en ADM-haven (kop)", "Het IJ (kop Noordhollandsch kanaal)", 
        "Amerikahaven kop Australiëhaven", "Jan van Riebeeckhaven (thv Nuon)", 
        "Amerikahaven (kop Aziëhaven)", "Jan van Riebeeckhaven (kop NZK)", 
        "Westhaven en ADM-haven (kop)", "Jan van Riebeeckhaven (thv Nuon)", 
        "Amerikahaven kop Australiëhaven", "Het IJ (thv EyE)", "Amerikahaven (kop Aziëhaven)", 
        "Mercuriushaven", "Westhaven en ADM-haven (kop)", "Amerikahaven kop Australiëhaven", 
        "Minervahaven", "Westhaven en ADM-haven (kop)", "Westhaven en ADM-haven (kop)", 
        "Amerikahaven kop Australiëhaven", "Amerikahaven (kop Aziëhaven)", 
        "Amerikahaven kop Australiëhaven", "Jan van Riebeeckhaven (thv Nuon)", 
        "Westhaven en ADM-haven (kop)", "Petroleumhaven", "Westhaven en ADM-haven (kop)", 
        "Het IJ (thv EyE)", "Het IJ (kop Noordhollandsch kanaal)", 
        "Het IJ (thv EyE)", "Westhaven en ADM-haven (kop)"), abund = c(5, 
        25, 2, 15, 3, 4, 1, 176, 4, 1, 1, 4, 55, 1, 1, 37, 75, 11, 
        1, 121, 4, 2, 2, 412, 38, 1, 5, 2, 443, 2, 6, 12, 1, 10, 
        33, 14, 120, 377, 67, 29, 43, 524, 4, 31, 18, 5, 18, 1, 9, 
        31), n = c(6L, 4L, 4L, 7L, 3L, 3L, 3L, 3L, 7L, 4L, 5L, 3L, 
        8L, 4L, 8L, 7L, 8L, 6L, 4L, 7L, 7L, 4L, 4L, 7L, 8L, 6L, 5L, 
        8L, 8L, 5L, 6L, 7L, 3L, 3L, 8L, 8L, 3L, 8L, 8L, 8L, 6L, 8L, 
        7L, 8L, 3L, 4L, 7L, 3L, 7L, 4L)), row.names = c(NA, -50L), class = c("tbl_df", 
    "tbl", "data.frame"), na.action = structure(c(`47` = 47L, `52` = 52L, 
    `60` = 60L, `88` = 88L, `128` = 128L, `401` = 401L, `488` = 488L, 
    `593` = 593L, `633` = 633L), class = "omit"))
Stevestingray
  • 399
  • 2
  • 12
  • to run a regression, you need multiple observations. You're cycling by 1 row - `lm()` won't work with a single observation – Vasily A Nov 13 '20 at 15:31
  • Sorry, i had to include it in the question. every combination has multiple observations, since there are 8 different time periods. – Stevestingray Nov 13 '20 at 15:34

2 Answers2

1
all_sites <- unique(df$site)
all_species <- unique(df$species)

slopes <- expand.grid(all_sites, all_species)
names(slopes) <- c('site','species')
slopes$coef <- NA_real_
  
for (site in all_sites) {
  for (species in all_species){
    this_subset <- df[(df$site==site & df$species==species),]
    if (nrow(this_subset)<2) next;
    y <- this_subset$abund
    x <- this_subset$year
    cat('\n',site,species,nrow(this_subset),sum(is.na(x)),sum(is.na(y)))
    slopes[slopes$site==site & slopes$species==species, ]$coef <- coef(lm(y ~ x))[2]
  }
}
Vasily A
  • 8,256
  • 10
  • 42
  • 76
  • thanks for your response. Unfortunately this error pops up: `Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) : 0 (non-NA) cases` I did not know a loop in a loop was a thing, so thanks for that! I believe that the 'this_subset' object produces NA's, which causes the loop to fail. I tried fixing it, but could not find a way yet. The OG dataset has no NA's in it (checked with `sum(is.na(df))`. – Stevestingray Nov 13 '20 at 16:19
  • add a line for debugging information (see update in my answer) and tell us what it says when the error ocurs – Vasily A Nov 13 '20 at 16:30
  • it might happen when there's no such combination of `site` and `species`, or it has only a single observation. I have added the control for that (line `if (nrow(this_subset)<2) next;`) - let us know if it works. – Vasily A Nov 13 '20 at 16:47
  • Thanks so much, indeed some combinations did not exist which caused the error. The if function did the trick. Learnt a bunch today, thanks! – Stevestingray Nov 13 '20 at 17:52
1

Here's some simpler code:

for(site in df$site){
  for(species in df$species){
    assign(paste0('model_', site, species), lm(abund~year, data=df[df$site==site & df$species==species,]))
  }
}

As mentioned in Vasily A's answer, you need to ensure you have more than one observation for each regression model being built.

Gaurav Bansal
  • 5,221
  • 14
  • 45
  • 91