I am trying to represent the sequence of a biological virus as ATGC
s, but I have seen code where it is represented as 1234
s instead. Are there any differences in memory usage or code speed if we store it as the integers [1,2,3,4]
instead of the letters [A,T,G,C]
?
For those who might need a bit more context, I will not be doing any mathematical operations on the string of numbers/letters apart from changing their identities at random positions (i.e. mutation), keeping track of the positions that are mutated away from a reference sequence in a dictionary (such as: {2:'G', 52:'A'}
or {2:3, 52:1}
), and exporting the full sequence of any biological virus strain by iterating over the reference sequence and checking the mutation dictionary for any mutations.