3

I am trying to represent the sequence of a biological virus as ATGCs, but I have seen code where it is represented as 1234s instead. Are there any differences in memory usage or code speed if we store it as the integers [1,2,3,4] instead of the letters [A,T,G,C]?

For those who might need a bit more context, I will not be doing any mathematical operations on the string of numbers/letters apart from changing their identities at random positions (i.e. mutation), keeping track of the positions that are mutated away from a reference sequence in a dictionary (such as: {2:'G', 52:'A'} or {2:3, 52:1}), and exporting the full sequence of any biological virus strain by iterating over the reference sequence and checking the mutation dictionary for any mutations.

ericmjl
  • 13,541
  • 12
  • 51
  • 80

1 Answers1

1

The use of strings or integers depends on the size of your DNA sequence. I know that some sequences might be over millions of elements. It is better to use typed integers if you are dealing with a lot of information. Otherwise, you can use strings if it is more suitable for you.

Taha
  • 709
  • 5
  • 10