Writing and Reading a big file for analytical purposes

Question

I'm trying to make a DNA Analytical tool, but I'm facing a big problem here.

Here's a screenshot on how the application looks like.

The problem I'm facing is handling large data. I've used streams and memory mapped files, but I'm not really sure if I'm heading in the right direction. What I'm trying to achieve is to be able to write a text file with 3 billion random letters, and then use that text file for later purposes. Currently i'm at 3000 letters, but generating more then that takes ages. How would you tackle this? Storing the full text file into a string seems like overload to me. Any ideas?

    private void WriteDNASequence(string dnaFile)
    {
        Dictionary<int, char> neucleotides = new Dictionary<int, char>();
        neucleotides.Add(0, 'A');
        neucleotides.Add(1, 'T');
        neucleotides.Add(2, 'C');
        neucleotides.Add(3, 'G');

        int BasePairs = 3000;

        using (StreamWriter sw = new StreamWriter(filepath + @"\" + dnaFile))
        {
            for (int i = 0; i < (BasePairs / 2); i++)
            {
                int neucleotide = RandomNumber(0, 4);
                sw.Write(neucleotides[neucleotide]);
            }
        }
    }

    private string ReadDNASequence(string dnaFile)
    {
        _DNAData = "";
        using (StreamReader file = new StreamReader(filepath + @"\" + dnaFile))
        {
            _DNAData = file.ReadToEnd();
        }
        return _DNAData;
    }
    //Function to get a random number 
    private static readonly Random random = new Random();
    private static readonly object syncLock = new object();
    public static int RandomNumber(int min, int max)
    {
        lock (syncLock)
        { // synchronize
            return random.Next(min, max);
        }
    }

Generate it in chunks and then write the chunk to the file. There should be no problem generating 500,000 letters in one string (use a StringBuilder please) and then flush that to a file. Rinse, repeat 20 times. — Ron Beyer, Mar 27 '18 at 12:46
Writing text should be about the easiest thing there is (but you managed to make it very complex). I can see a problem with reading (it all) but that is not being asked here. — H H, Mar 27 '18 at 12:47
If you have one process who writes it sequentially, and then one process who reads the file sequentially, there is no need for memory mapped files. Just do it. You will probably need to buffer your data, reading or writing one byte at a time is time consuming. You may also want to use a ANSI one-byte-per-char file if you only have normal ANSI characters. Maybe even pack them into something smaller, if you don't have 256 possible values per byte. — nvoigt, Mar 27 '18 at 12:47
Use a database like SQL Server which is designed to handle huge amount of data. — jdweng, Mar 27 '18 at 12:49
It's probably not safe to write to the same view accessor from multiple threads by the way. — Evk, Mar 27 '18 at 12:52
So how long does it take now to generate file with 3 billion pairs? And how much do you want? — Evk, Mar 27 '18 at 13:14
How long? 100 seconds, 200? You need to have some goal (like "if it will take 10 seconds - I'm fine with that"). — Evk, Mar 27 '18 at 13:19
Asking that because I can write an answer which can do that in about 10 seconds on my ancient HDD drive, but I have no idea whether it's still slow for your or not. — Evk, Mar 27 '18 at 13:44
it does 300 000 000 in 20 seconds. That's 300M, need to get to 3 billion. — Hades, Mar 27 '18 at 13:49

Evk · Accepted Answer · 2018-03-27T15:30:30.957

When working with such big amount of data - every bit matters and you have to pack data as dense as possible.

As of now, each nucleotide is represented by one char, and one char in encoding you use (that's UTF-8 by default) takes 1 byte (for those 4 chars you use).

But since you have just 4 different characters - each character holds only 2 bits of information, so we can represent them as:

00 - A
01 - T
10 - C
11 - G

That means we can pack 4 nucleotides in one byte, making output file size 4 times smaller.

Assuming you have such map:

static readonly Dictionary<char, byte> _neucleotides = new Dictionary<char, byte> { 
{ 'A', 0},
{ 'T', 1},
{ 'C', 2},
{ 'G', 3}
};
static readonly Dictionary<int, char> _reverseNucleotides = new Dictionary<int, char> {
    {0, 'A'},
    {1, 'T'},
    {2, 'C'},
    {3, 'G'}
};

You can pack 4 nucleotides like in one byte like this:

string toPack = "ATCG";
byte packed = 0;
for (int i = 0; i < 4; i++) {
    packed = (byte) (packed | _neucleotides[toPack[i]] << (i * 2));
}

And unpack back like this:

string unpacked = new string(new[] {
    _reverseNucleotides[packed & 0b11],
    _reverseNucleotides[(packed & 0b1100) >> 2],
    _reverseNucleotides[(packed & 0b110000) >> 4],
    _reverseNucleotides[(packed & 0b11000000) >> 6],
});

As for writing bytes to file, I think that's easy enough. If you need some random data in this case, use:

int chunkSize = 1024 * 1024; // 8 million pairs at once (since each byte is 4 nucleotides)
byte[] chunk = new byte[chunkSize];
random.NextBytes(chunk);
// fileStream is instance of `FileStream`, no need for `StreamWriter`
fileStream.Write(chunk, 0, chunk.Length);

There are some caveats (like last byte in a file might store not 4 nucleotides but less), but I hope you'll figure that out yourself.

With that approach (packing in binary, generating big random chunk at once, writing big chunk to file) - generating 3 billion pairs took 8 seconds on my very old (7 years) HDD, and output size is 350MB. You can even read all that 350MB into memory at once if necessary.

The random byte packing using all possible bits and the resulting usage of NextBytes is great. In my tests, producing random integers from 0-3 to have single bytes takes longer than actually writing them to disk. — nvoigt, Mar 27 '18 at 15:26
My vote for a very inventive solution. Running this on my machine took about 4.5 seconds. — Ron Beyer, Mar 27 '18 at 15:34
Hey, fantastic answer! One question, can you show how you generated the 3 billion pairs using the chunkSize? — Hades, Mar 28 '18 at 10:12
@Eli I don't have time right now to write that code, but it should be easy enough. 3 billion pairs is 1.5 billion characters as I understand. Each 4 characters is 1 byte. So divide that by 4. Then generate random bytes as described above, in chunks, and write to file (also as described) until you write specified number of bytes (last chunk might be smaller that your chunk size of course). — Evk, Mar 28 '18 at 10:20

Writing and Reading a big file for analytical purposes

1 Answers1