1

In my postgres database I have a column which contains sequences of characters. The characters in these sequences are amino acids. There are only 20 amino acids plus some extra characters needed for special purposes.

Currently these are stored with type 'character varying'. I assume that this is inefficient because one byte is used per a character whereas in theory my alphabet could be represented by 5 bits (2 ** 5 = 32). By inefficient I mean it takes more memory than is necessary and that if there are less bits to check comparison methods (such as checking whether one string is equal another or contains another) would require more operations.

Is this correct? Is there some more efficient way I could store this data to minimise the size of the database and to make string operations more efficient?

Timur Shtatland
  • 12,024
  • 2
  • 30
  • 47
public static void
  • 1,153
  • 11
  • 20
  • 1
    ...how big is your dataset such that you think UTF-8 is "inefficient"? – Dai Apr 21 '23 at 11:18
  • 1
    Generally speaking it's a bad idea to introduce your own text-to-binary encoding when saving to a database (presuming you're using `varbinary` instead of using an actual database type definition) - as you'll lose the ability to use all of the text-optimized features in any SQL database. – Dai Apr 21 '23 at 11:20
  • Do you have a problem? If so, could you share it with us? – Frank Heikens Apr 21 '23 at 11:28
  • 2
    The game isn't worth the candle (more harm than good). – klin Apr 21 '23 at 12:04

1 Answers1

1

Don't do that. The savings on storage are marginal, while the cons are substantial:

  • You incur higher development and maintenance cost of encoding/decoding to IUPAC code of amino acids.
  • You lose the ability to search for sequences using powerful regular expressions: for example, SELECT * FROM proteins WHERE sequence ~ '^Y.{2,3}[RK]L'
Timur Shtatland
  • 12,024
  • 2
  • 30
  • 47
  • Makes sense thanks. Is there a fundamental reason regular expressions could not be reimplemented for a library relying on less than 1 byte characters? – public static void Apr 25 '23 at 14:19
  • None that I know. Feel free to ask this as a separate question: this way you may get am answer from regex gurus on this site. – Timur Shtatland Apr 25 '23 at 14:33