0

Since genomic sequences vary greatly in length, I have been trying to work on using denoising autoencoders to get a compact representation for any given sequence. My expected input is a sequence of nucleotides (letters - A, G, T, C), for example, "AAAAGGAATTTCTCTGGGG....".

For images, adding a noise is easy since it's a continuous space. But in a discrete scenario such as this, what would be a good strategy to add noise to my input?

My first thought is to randomly replace some of the nucleotides with "N", which means that the nucleotide at that position couldn't be identified accurately during sequencing. But changing even one nucleotide leads to a completely different sequence altogether, unlike images where adding a small noise doesn't change how the image looks visually. Please let me know if this is right or there's a better way that I am not aware of.

merv
  • 67,214
  • 13
  • 180
  • 245
Aman Dalmia
  • 356
  • 2
  • 10
  • Your expected output is unclear. Are you trying to compress the sequences to smaller representations, but still be able to decompress? Or are you trying to clean noisy sequences and output cleaned sequences? Also, you can think of a sequence "ACGACTAGCTCTATTGC" as a one dimension image with a 2-bit colorspace. – merv Mar 13 '18 at 21:15
  • @merv I am trying to have a compact representation for a given sequence, so your first guess seems to be what I am looking for. It's right that it can be viewed as an image but as I mentioned, a small noise added to an image wouldn't make much of a difference visually, but in this case, it changes the sequence entirely. Please correct me if I am wrong, but just adding some kind of random noise and/or zeroing out a few positions be an actual noisy example from which we expect to recover the original sequence? – Aman Dalmia Mar 14 '18 at 09:41
  • Then I think you simply want an autoencoder, no noise. Otherwise, for "noisy" data, you need to clarify your goals, since currently it seems like an ill-defined problem. From a biological perspective a single-nucleotide change can have a drastic functional impact, but if you are talking about noise in high-throughput sequencing reads, those errors rarely propagate up to the consensus sequence. Again, I think if you clarified your output goal, this would make more sense. Searching PubMed/bioRxiv to get a sense of how others are applying denoising autoencoders could be useful. – merv Mar 14 '18 at 19:43
  • @merv Sure, thanks for your response. Most of the work using Denoising autoencoders have been on gene expression data, in which case adding noise is straight-forward (e.g. dropping a few positions). So, I'll first try the autoencoders without any noise, but if the results are not satisfactory, I might have to resort something like adding noise to the Phred Quality scores as suggested in the answer below. – Aman Dalmia Mar 15 '18 at 00:32

1 Answers1

1

I'm not sure if this will help you or further complicate your issue, but in biology people normally use FASTQ files to store biological sequences and their corresponding Phred quality scores. A Phred quality score is a measure of the quality of the identification of the nucleobases generated by automated DNA sequencing.

For example, if Phred assigns a quality score of 30 to a base, the chances that this base is called incorrectly are 1 in 1000.

Phred quality scores shown on a DNA sequence trace Public domain image from Wikipedia

So you can add noise to the Phred quality scores (i.e. the probabilities that the base calling is correct) without changing the sequence.

Also see this paragraph about current work done on compressing FASTQ files.

BioGeek
  • 21,897
  • 23
  • 83
  • 145
  • While I agree that if the goal is to correct HTS reads that FASTQs should be used, I don't understand what perturbing the quals would accomplish. Is one supposed to predict the "true" qual scores? It would make more sense to take actual HTS reads that came from a known source (e.g., ERCC spike-ins or phage gDNA) and train the net to output the true sequence. The original reads already have noise that could be useful to model. – merv Mar 14 '18 at 19:59
  • Thanks for the detailed answer. This seems extremely helpful. I understand the gist of what you are trying to convey. But as asked by @merv, do I predict the true qual scores, or just sample according to the noisy qual scores to make my noisy input? – Aman Dalmia Mar 15 '18 at 00:35