0

I would like to ask why my rarefaction curve both for the rarefied data and non rarefied doesnt seem ideal. What could be a cause to this?

This was the script i used:

sam.data_soil_rare <- data.frame(bac.unedited@sam_data)
sam.data_soil_rare 
OTU.table_soil_rare <- otu_table(bac.unedited) %>%
  as.data.frame() %>%
  as.matrix()
OTU.table_soil_rare

raremax_soil <- min(rowSums(t(OTU.table_soil_rare)))
kbd>rare.fun_soil <- rarecurve(t(OTU.table_soil_rare), step = 200, sample = raremax, tidy = T)
rare.fun_soil
bac.rare.curve.extract2_soil <- left_join(sam.data_soil_rare, rare.fun_soil, by = c("Samples" = "Site"))
bac.rare.curve.extract2_soil

bac.rare_soil_plot <- ggplot(bac.rare.curve.extract2_soil, aes(x = Sample, y = Species, group = Samples,
color = Sample_or_Control)) + 
  #geom_point() +
  geom_line(size = 1) + 
  xlab("Reads") +
  ylab("Number of ASVs") +
  ggtitle("Soil") +
  theme_classic() + 
  theme(legend.position="none") +
  geom_vline(xintercept = median(sample_sums(soil_bac_sub_no_bad_Filtered)), linetype = "solid") +
  scale_color_manual(values = cbbPalette)
bac.rare_soil_plot

I tried using a different package to check the rarefaction curve on unfiltered phyloseq but got the same typical non-ideal graph.

John Kugelman
  • 349,597
  • 67
  • 533
  • 578

1 Answers1

0

There is no reproducible example, and you do not explain how the curves are "non-ideal". They look pretty normal to me.

My guess is that you have multiplied your data with some arbitrary numbers and this makes them unsuitable for rarefaction. Perhaps you have even pre-processed data and removed some of the rarest cases as "read errors" which makes data even less suitable for rarefaction.

Rarefaction should be applied for observed data. In general, rarefaction works by subsampling your observed data, and in this process some species drop off first and this reduces the number of species. If you have taxa that only occur once or twice, they are often some of the first to drop off, and this gives smoothly changing rarefaction curve. If you have multiplied your data (like is common in your field), the taxon that you have only observed once will be multiplied, say, to abundance 1000 (or what ever is your multiplier). Dropping off taxa with such huge numbers happens much more slowly, and for long time you have no reduction in the number of species and you get that horizontal line. Only when you have subsampled to very low fractions, you undo your multiplication and then all species start dropping off very rapidly and you get that very steep part of the curve.

You can see the effect of multiplication by comparing these two models using vegan data:

library(vegan)
data(BCI)
rarecurve(BCI) # observed counts: correct
rarecurve(10 * BCI) # multiplied: wrong

Current version of vegan issues a warning if you do not have any counts of 1. This may be a false positive, but it should alert you to check that you are using observed data instead of multiplied data or data with rare cases removed.

Jari Oksanen
  • 3,287
  • 1
  • 11
  • 15
  • Thanks so much for your swift reply. I have actually done some filtering by removing chloroplast and mitochondria. In addition to this, I also removed singletons. I guess perhaps I need to perform the rarefaction curve on the original file without any processing. I will try that out. My definition of an ideal rarefaction curve is one where the sample is more diverse with an increasing number of sequencing. I will definitely try out what you suggested. Sorry, I am relatively new to learning microbiome and this would be my first project. Thanks – Aduragbemi Jun 24 '23 at 17:43
  • No worries @Aduragbemi: rarefaction is a delicate business, and not so easy to do right. Naturally, you should remove clearly (a priori) wrong reads (mitochondria, chloroplasts), but singletons are a different issue. Rarefaction kind of expects singletons, and for that purpose they should be retained, unless you have reason to think they are wrong (and which of them are wrong). For *other* analyses it often makes sense to remove these, but rarefaction is *only* a method of assessing equalized taxon (or OTU) richness. – Jari Oksanen Jun 24 '23 at 21:05
  • I used the original file without filtering or normalization and still got the same type of graph. Do i need to restart the whole process again? – Aduragbemi Jun 25 '23 at 02:13
  • This means that you do not have rare OTUs and are OTUs are abundant and perhaps even about equally abundant. If these are you date, this is the kind of rarefaction curve you get. However, we do not have your data, nor reproducible example, and this is only guessing. – Jari Oksanen Jun 25 '23 at 07:08
  • I wouldnt mind sharing the otu table and metadata. I really have explored the options and I am looking to see if i need to restart the whole stuff again from the fastqc to adapter removal but even if i restart i wouldnt do anything different. Moreso i used DADA2 so the clustering was in ASVs and not OTUs – Aduragbemi Jun 25 '23 at 17:48
  • I'm not in your field, and words like fastqc, adapter removal, dada2, asv are gibberish to me. However, you need simple unmodified counts: how many times you really have observed (counted) an OTU in this sampling unit. The minimum should be one read. If the lowest possible minimum is greater than one, there is no sense to rarefy. – Jari Oksanen Jun 25 '23 at 21:29
  • Still one comment: your rarefaction curves are probably correct for **your data**. Do you really have data with some 20000 **observed and counted** individuals per sampling unit? I have an itch that somewhere in your process you have multiplied your data so that they would be expressed per mass or volume or some other unit instead of being expressed as observed counts per sampling unit. – Jari Oksanen Jun 25 '23 at 21:45