0

I am studying the microbiome of an organism and I have sequenced 3 PCR replicates for each sample. I am working with a phyloseq object and my sample_df looks like this (simplified):

> df
phyloseq-class experiment-level object
otu_table()   OTU Table:         [ 28974 taxa and 112 samples ]
sample_data() Sample Data:       [ 112 samples by 2 sample variables ]
tax_table()   Taxonomy Table:    [ 28961 taxa by 10 taxonomic ranks ]

>sample_data(df)
Sample Data:        [112 samples by 2 sample variables]:
                
            sample   replicate
s1_A        sample1       A
s1_B        sample1       B
s1_C        sample1       C
s2_A        sample2       A
s2_B        sample2       B
s2_C        sample2       C

I want to merge my ESVs reads, but only when they appear in two out of three replicates and I have no idea how to do it.

I know about the function merge_samples to merge all the replicates, but I would like to add the condition of 2/3 replicates

df_merged<- merge_samples(df, "sample", fun=sum) #merge samples

I have no idea how to continue, so any help would be great

thanks so much!

  • using package {dplyr} you should be able to filter out replicate counts of two per sample like this: `df_merged |> add_count(sample, replicate) |> filter(n == 2)` (adjust condition as required) – I_O Jul 15 '23 at 10:14
  • hi! thanks for the coment! That looks good but it cannot be applied to a phyloseq object... in that object, the sample variables are in one dataset (sample_data), and the ESVs counts are in another (otu_table). – Laura Drh Jul 15 '23 at 10:44
  • Maybe consider asking this on https://bioinformatics.stackexchange.com – Andre Wildberg Jul 15 '23 at 10:59

1 Answers1

0

In case someone needs it, I managed this way:

# Create an empty data frame with column names
      
      comm <- t(as.data.frame(otu_table(df)))
      merged_data <- data.frame(matrix(ncol = ncol(comm), nrow = 0))
      merged_data
      names(merged_data) <- colnames(comm)
      base_sample_names <- gsub("_.$", "", rownames(comm)) # Get the base names of the samples (without the '_A', '_B', and '_C' suffixes)      
      unique_base_samples <- unique(base_sample_names)  # Get the unique base sample names
      
      
      # For each unique base sample name:
      
      for (base_sample in unique_base_samples) {
        replicate_indices <- which(base_sample_names == base_sample)  # Get the indices of the rows corresponding to the three replicates
        replicates <- comm[replicate_indices, ]   # Extract the replicate rows
        
        
        
        # Compute the merged row
        
        merged_row <- apply(replicates, 2, function(column) { 
          
          # If the ESV is present in at least two replicates, return the sum of abundances. Can also be changed to 3 or just 1
          
          if (sum(column > 0) >= 2) {
            return(sum(column))
          } else {
            return(0)
          }
        })
        merged_row <- data.frame(t(merged_row))     #Make sure the result is a data frame with correct column names
        colnames(merged_row) <- colnames(comm)
        merged_data <- rbind(merged_data, merged_row)    # Add the merged row to the data frame
      }
      
      rownames(merged_data) <- unique_base_samples # Set the row names of the merged data
      print(merged_data) # Print the merged data
      merged_data<-t(merged_data)  #otu are rows
  • Remember that Stack Overflow isn't just intended to solve the immediate problem, but also to help future readers find solutions to similar problems, which requires understanding the underlying code. This is especially important for members of our community who are beginners, and not familiar with the syntax. Given that, **can you [edit] your answer to include an explanation of what you're doing** and why you believe it is the best approach? – Jeremy Caney Jul 24 '23 at 04:12
  • Hi, I included an answer the 19/jul on how I managed – Laura Drh Jul 31 '23 at 13:28
  • Yes, that’s what this comment was in response to. – Jeremy Caney Jul 31 '23 at 18:01