1

I am running a function along a genetic string (such as CCAGTAATTA). At each letter my function computes an integer. So for the string above, there are 10 letters and I get a vector with 10 integers.

For chromosome 1, there are 195 million letters, so I get a very long vector. However, while the sequence only takes up 186.4Mb of memory, the vector takes over 1.5GB.

I have two questions:

  1. Is there a more efficient way to store the integers? I need 195 million integers in some format.

  2. Is there a method for writing it to a file? I have tried write.csv, but it crashes due to the size.

Tom
  • 105
  • 1
  • 7
  • 2
    Have you tried using the tools available from [Bioconductor](https://www.bioconductor.org/)? – Rich Scriven Nov 25 '15 at 18:46
  • Yes, the DNA string is processed using Biostrings from Bioconductor. I'm looking for a similar solution for the integers as part of my analysis, if possible! – Tom Nov 25 '15 at 18:52
  • 2
    the `integer` class is half the size of `numeric`. So instead of 8x more memory, you would be using 4x more memory. – fishtank Nov 25 '15 at 19:22
  • try `gzfile` http://stackoverflow.com/questions/17492409/how-to-directly-perform-write-csv-in-r-into-tar-gz-format – fishtank Nov 25 '15 at 19:27
  • For small range of integer values two-character representatiosn might offer compactness via hexmode: `as.integer(as.hexmode('ff')) [1] 255` – IRTFM Nov 25 '15 at 21:08
  • Try to solve your problem from a biological view point: Divide your chromosomes into loci (chromosome regions). They are well defined and reduce the amount of nucleotides per sequence. If still too long (they probably will), you can subdivide them once more. – Robert Kirsten Nov 26 '15 at 17:52

0 Answers0