at the moment I'm learning Rust. Comming from a biology background I thought a good training would be to write a small bioinformatics tool. I came up with the idea to implement a struct/enum for sequences that provide some of the most common methods for such sequences like transcription, translation, calculating GC content etc. However, there are quite some design choices to be made beforehand and I'm a bit lost at this point. My problem is the following:
There are essentially three types of sequences to be considered: DNA, RNA and proteins. All of these are sequences and they share some behavior (for instance it should be possible to count the appearances of symbols in all of them) but also have differences (a DNA can be transcribed, a protein cannot). Inspired by the IPAddr enum from the standard library that was mentioned in the Rust book I started by creating an enum that reflects the three possible sequence types.
pub enum Sequence {
DNA(DNASeq),
RNA(RNASeq),
Protein(AASeq)
}
pub struct DNASeq {
seq: String
}
pub struct RNASeq {
seq: String
}
pub struct AASeq { // AA = Amino acid => protein.
seq: String
}
As I mentioned some methods make sense for some sequences but not for others. For instance, DNA can be transcribed to RNA but a protein cannot be transcribed. So I decided to implement a transcribe
method for the enum and a corresponding method for DNASeq:
impl Sequence {
pub fn transcribe(&self) -> Sequence::RNA {
match self {
DNA(seq) => Sequence::RNA(seq.transcribe()),
RNA(seq) => Sequence::RNA(seq),
AASeq(_) => Err("Amino acid sequences cannot be transcribed"),
}
}
}
impl DNASeq {
pub fn transcribe(&self) -> RNASeq {
let seq = self.seq.chars()
.map(|x| match x {
'T' => 'U',
_ => x
})
.collect();
RNASeq{ seq }
}
}
For me, that makes sense so far. However, what about methods that are common to all three sequence variants? For instance, what about a method that counts the appearance of symbols in a sequence. The code is essentially the same in all cases. A naive solution would be to implement such a method for each struct:
impl Sequence {
// ...
pub fn count(&self) -> HashMap<char, usize> {
match self {
DNA(seq) => seq.count(),
RNA(seq) => seq.count(),
Protein(seq) => seq.count()
}
}
}
impl DNASeq {
// ...
pub fn count(&self) -> HashMap<char, usize> {
let mut cnt = HashMap::new();
for c in self.seq.chars() {
*cnt.entry(c).or_insert(0) += 1;
}
cnt
}
}
impl RNASeq {
// As above.
}
impl AASeq {
// As above.
}
This leads to a lot of code duplication. Of course, I could write some function that does the job and then calling that function from the enum method but that'd separate the count implementation from the structs. Another idea was to create a trait and provide a default implementation but I cannot access the structs fields from a default implementation (which makes sense to me) so this also doesn't work. What would be a better way to design this? Or is my entire approach questionable? Do you have any further valuable criticism?
Thank you a lot in advance!