How to design DNA/RNA/protein sequences in Rust

Question

at the moment I'm learning Rust. Comming from a biology background I thought a good training would be to write a small bioinformatics tool. I came up with the idea to implement a struct/enum for sequences that provide some of the most common methods for such sequences like transcription, translation, calculating GC content etc. However, there are quite some design choices to be made beforehand and I'm a bit lost at this point. My problem is the following:

There are essentially three types of sequences to be considered: DNA, RNA and proteins. All of these are sequences and they share some behavior (for instance it should be possible to count the appearances of symbols in all of them) but also have differences (a DNA can be transcribed, a protein cannot). Inspired by the IPAddr enum from the standard library that was mentioned in the Rust book I started by creating an enum that reflects the three possible sequence types.

pub enum Sequence {
    DNA(DNASeq),
    RNA(RNASeq),
    Protein(AASeq)
}


pub struct DNASeq {
    seq: String
}


pub struct RNASeq {
    seq: String
}


pub struct AASeq {  // AA = Amino acid => protein.
    seq: String
}

As I mentioned some methods make sense for some sequences but not for others. For instance, DNA can be transcribed to RNA but a protein cannot be transcribed. So I decided to implement a transcribe method for the enum and a corresponding method for DNASeq:

impl Sequence {
    pub fn transcribe(&self) -> Sequence::RNA {
        match self {
            DNA(seq) => Sequence::RNA(seq.transcribe()),
            RNA(seq) => Sequence::RNA(seq),
            AASeq(_) => Err("Amino acid sequences cannot be transcribed"),
        }
    }
}


impl DNASeq {
    pub fn transcribe(&self) -> RNASeq {
        let seq = self.seq.chars()
            .map(|x| match x {
                'T' => 'U',
                _ => x
            })
            .collect();

        RNASeq{ seq }
    }
}

For me, that makes sense so far. However, what about methods that are common to all three sequence variants? For instance, what about a method that counts the appearance of symbols in a sequence. The code is essentially the same in all cases. A naive solution would be to implement such a method for each struct:

impl Sequence {
    // ...

    pub fn count(&self) -> HashMap<char, usize> {
        match self {
            DNA(seq) => seq.count(),
            RNA(seq) => seq.count(),
            Protein(seq) => seq.count()
        }
    }
}


impl DNASeq {
    // ...

    pub fn count(&self) -> HashMap<char, usize> {
        let mut cnt = HashMap::new();

        for c in self.seq.chars() {
            *cnt.entry(c).or_insert(0) += 1;
        }

        cnt
    }
}


impl RNASeq {
    // As above.
}


impl AASeq {
    // As above.
}

This leads to a lot of code duplication. Of course, I could write some function that does the job and then calling that function from the enum method but that'd separate the count implementation from the structs. Another idea was to create a trait and provide a default implementation but I cannot access the structs fields from a default implementation (which makes sense to me) so this also doesn't work. What would be a better way to design this? Or is my entire approach questionable? Do you have any further valuable criticism?

Thank you a lot in advance!

‘I cannot access the structs fields from a default implementation’ you could add a `fn as_seq(&self) -> &str` function on the trait and implement that for the sequences to access `seq`. I'm not that experienced with Rust, but I would use traits instead of an enum. It doesn't make sense to call `transcribe` with an a.a. sequence, so maybe have a `Transcribable` trait that on the DNA and RNA sequences implement. All the sequences could also implement a `Sequence` trait that has the `count` function. — Lauren Yim, Sep 02 '21 at 12:44
You might also have better luck asking this on [Code Review](https://codereview.stackexchange.com/) if you have a working implementation and are looking for feedback on it. — Lauren Yim, Sep 02 '21 at 12:46
Hello, about duplicate code, the path is the macro. The answer to the following article (https://stackoverflow.com/questions/45664392/is-it-possible-to-create-a-macro-that-implements-ord-by-delegating-to-a-struct- m) can surely put you on the track — Zeppi, Sep 02 '21 at 12:49
The `.transcribe()` method is a good candidate to implement rust's `Into` trait. It allows you to write: `RNASeq rna = dna.into()` as long as `DNASeq` implements `Into` or `RNASeq` implements `From` trait. — loa_in_, Sep 02 '21 at 17:45
Thank you all very much for the input. I started by generating traits to express some of the functionality that is common to all or some of the Sequence types. I will also think about the `Into` teait. I basically like it language-wise but I'm not yet certain if this meaning-wise hides that what is going on is transcription. — R. Schleutker, Sep 02 '21 at 18:12

How to design DNA/RNA/protein sequences in Rust

0 Answers0