0

I successfully completed the DADA2 Pipeline Tutorial (https://benjjneb.github.io/dada2/tutorial.html) using my own data but have gotten stuck with the transition to Phyloseq. I need to construct a simple data.frame from the information encoded in the filenames. This is the code provided in the tutorial.

#Make a data.frame holding the sample data
samples.out <- rownames(seqtab.nochim)
subject <- sapply(strsplit(samples.out, "D"), `[`, 1)
gender <- substr(subject,1,1)
subject <- substr(subject,2,999)
day <- as.integer(sapply(strsplit(samples.out, "D"), `[`, 2))
samdf <- data.frame(Subject=subject, Gender=gender, Day=day)
samdf$When <- "Early"
samdf$When[samdf$Day>100] <- "Late"
rownames(samdf) <- samples.out

Mine should be more simple than this because I don't have time as a factor. I just have six treatment groups.

This is me trying to figure it out.

#Make a data.frame holding the sample data
samples.out <- rownames(seqtab.nochim)

#create vector with the treatments
trtmt <- c("EM", "EP", "EM", "AR37", "NEA2", "AR1", "AR37", "NEA2", "EP", "NEA2", "EP", "EM", "AR37", "EP", "NEA2", "Ctrl", "Ctrl", "AR37", "EP", "AR37", "AR37", "EP", "AR1", "AR1", "EP", "EM", "EM", "AR37", "AR1", "EM", "AR37", "NEA2", "AR1", "Ctrl", "EP", "Ctrl", "EP", "AR37", "AR37")

#Add a new column to the samples.out dataframe 
samples.out_2 <- samples.out
samples.out_2 <- cbind(samples.out, new_col = trtmt)

#Rename columns
colnames(samples.out_2)[colnames(samples.out_2) == "samples.out"] <- "Sample"
colnames(samples.out_2)[colnames(samples.out_2) == "new_col"] <- "Treatment"

#Head of my samples.out_2 data frame (I have a total of 39 samples and 6 treatment groups)
Sample Treatment
193    EM
194    EP
196    EM
197    AR37
198    NEA2

#Still stuck with how to make this relevant to my metadata!
sample <- sapply(strsplit(samples.out_2, "D"), `[`, 1) #what does the "D" mean (I think it has to do with the mouse dataset used in the tutorial)? However, I am not sure what I need to pull from my data.frame. Also, What does '[' mean? I know the meanings for operators like [], (), etc., but not for a single one in quotes.
treatment <- substr(sample,1,39) #I don't understand what I am trying to extract or change
sample <- substr(sample,2,999) #I don't understand what I am trying to extract or change
samdf <- data.frame(Sample=sample, Treatment=treatment)
rownames(samdf) <- samples.out

If anyone has gone through this tutorial with their own data, and understood this transition, I would appreciate your insights. Thanks

1 Answers1

0

You want to create a data frame with your metadata in an object called samdf (to do as in the tutorial). In the tutorial, the sequences have their metadata encoded in their filename (which doesn't seem to be the case with your data):

for example for the first

F3D0 : Gender (F)-Subject-(no3)-Day (D0)

The lines of code to define Subject, Gender and Day in the tutorial are not relevant for your data.

subject <- sapply(strsplit(samples.out, "D"), `[`, 1) # define subject as beginning of the filename string up to D
gender <- substr(subject,1,1) #gets first letter for the gender
subject <- substr(subject,2,999) #remove gender to actually get the subject number
day <- as.integer(sapply(strsplit(samples.out, "D"), `[`, 2)) #define day

The last two lines are important, one to create your dataframe with your metadata, the second to assign the same rownames as in seqtab.nochim, so that you can build your phyloseq object further down the pipeline. Make sure samdf and seqtab.nochim have the same number of rows:

isTRUE(dim(seqtab.nochim)[1] == dim(samdf)[1]) #should be true
Dharman
  • 30,962
  • 25
  • 85
  • 135