2

How can I take 1 million substring from a string with more than 3 million characters efficiently in C#? I have written a program which involves reading random DNA reads (substrings from random position) of length say 100 from a string with 3 million characters. There are 1 million such reads. Currently i run a while loop that runs 1 million times and read a substring of 100 character length from the string with 3 million character. This is taking a long time. What can i do to complete this faster?

heres my code, len is the length of the original string, 3 million in this case, it may be as low as 50 thats why the check in the while loop.

while(i < 1000000 && len-100> 0) //len is 3000000
            {
                int randomPos = _random.Next()%(len - ReadLength);
                readString += all.Substring(randomPos, ReadLength) + Environment.NewLine;
                i++;


            }
P basak
  • 4,874
  • 11
  • 40
  • 63
  • How often do you switch DNA strands and read a new one? Is there a set number of total DNA strands you're reading? – Jason Mar 21 '12 at 09:34
  • How about making DNA smaller, Maybe 1 byte in length? :-D – Rohit Vipin Mathews Mar 21 '12 at 09:35
  • Would applying multi-threading work? – Nick Mar 21 '12 at 09:37
  • You read into the DNA string at random positions? You don't need read the string to determine the substring? – Slugart Mar 21 '12 at 09:37
  • 3
    You could probably benefit by using a more space-efficient type than `String` to start with (do you really need 2 bytes for every position?), but I suspect that what's really slowing you down is the line `readString += ...`, which is allocating a million new strings and probably causing the garbage collector to froth at the mouth. Instead of using a `String` for `readString`, use `StringBuilder readString = new StringBuilder(ReadLength * numSubstrings);`, then each time through the loop, `readString.AppendLine(all.Substring...);`. – anton.burger Mar 21 '12 at 09:46
  • Thanks a million. I used this and it completed in a second. Great man. Please add it to the answer section. Thanks. – P basak Mar 21 '12 at 10:00

4 Answers4

2

Using a StringBuilder to assemble the string will get you a 600 times increase in processing (as it avoids repeated object creation everytime you append to the string.

before loop (initialising capacity avoids recreating the backing array in StringBuilder):

StringBuilder sb = new StringBuilder(1000000 * ReadLength);

in loop:

sb.Append(all.Substring(randomPos, ReadLength) + Environment.NewLine);

after loop:

readString = sb.ToString();

Using a char array instead of a string to extract the values yeilds another 30% improvement as you avoid object creation incurred when calling Substring():

before loop:

char[] chars = all.ToCharArray();

in loop:

sb.Append(chars, randomPos, ReadLength);
sb.AppendLine();

Edit (final version which does not use StringBuilder and executes in 300ms):

char[] chars = all.ToCharArray();    
var iterations = 1000000;
char[] results = new char[iterations * (ReadLength + 1)];    
GetRandomStrings(len, iterations, ReadLength, chars, results, 0);    
string s = new string(results);

private static void GetRandomStrings(int len, int iterations, int ReadLength, char[] chars, char[] result, int resultIndex)
{
    Random random = new Random();
    int i = 0, index = resultIndex;
    while (i < iterations && len - 100 > 0) //len is 3000000 
    {
        var i1 = len - ReadLength;
        int randomPos = random.Next() % i1;

        Array.Copy(chars, randomPos, result, index, ReadLength);
        index += ReadLength;
        result[index] = Environment.NewLine[0];
        index++;

        i++;
    }
}
Slugart
  • 4,535
  • 24
  • 32
  • yes, i found the string builder is the best option. i will also consider the char array option. – P basak Mar 21 '12 at 10:06
  • Hi i used your final version. It seems to me that it cannot be done in a faster way.:) – P basak Mar 21 '12 at 11:11
  • Does that final version definitely behave correctly? `Buffer.BlockCopy` operates on *byte* offsets and lengths, not *array element* indices and lengths. With a `char` taking 2 bytes, (1) any time `randomPos` is odd, you're going to start splicing pieces of adjacent characters together, and (2) `ReadLength == X` will get you `X` bytes but only `X / 2` characters. Possibly better to use `Array.Copy` instead. – anton.burger Mar 21 '12 at 16:53
  • Mine gets all strings in 43ms, and adding randomizing will not change that by more than 10ms. – SimpleVar Mar 22 '12 at 01:27
  • @Anton you're absolutely right, chars are 2 bytes wide so the offsets will be off using Buffer.BlockCopy - I've edited to use Array.Copy instead. – Slugart Mar 22 '12 at 10:30
1

I think better solutions will come, but .NET StringBuilder class instances are faster than String class instances because it handles data as a Stream.

You can split the data in pieces and use .NET Task Parallel Library for Multithreading and Parallelism

Edit: Assign fixed values to a variable out of the loop to avoid recalculation;

int x = len-100 
int y = len-ReadLength 

use

StringBuilder readString= new StringBuilder(ReadLength * numberOfSubStrings);
readString.AppendLine(all.Substring(randomPos, ReadLength));

for Parallelism you should split your input to pieces. Then run these operations on pieces in seperate threads. Then combine the results.

Important: As my previous experiences showed these operations run faster with .NET v2.0 rather than v4.0, so you should change your projects target framework version; but you can't use Task Parallel Library with .NET v2.0 so you should use multithreading in oldschool way like

Thread newThread ......
qwerty
  • 2,065
  • 2
  • 28
  • 39
  • Hi can you please provide some examples? – P basak Mar 21 '12 at 09:40
  • Added some sample and more definition – qwerty Mar 21 '12 at 10:04
  • 1
    Hi, using stringbuilder provided me necessary performance gain. The job completed in seconds! – P basak Mar 21 '12 at 10:10
  • You would need to be carefull in parallelising this code as it uses the Random class which is not thread-safe. If you were to go down this path you should look at the solutions suggested here: http://blogs.msdn.com/b/pfxteam/archive/2009/02/19/9434171.aspx – Slugart Mar 21 '12 at 10:12
0

Edit: I abandoned the idea to use memcpy, and I think the result is super great. I've broken a 3m length string into 30k strings of 100 length each in 43 milliseconds.

private static unsafe string[] Scan(string hugeString, int subStringSize)
{
    var results = new string[hugeString.Length / subStringSize];

    var gcHandle = GCHandle.Alloc(hugeString, GCHandleType.Pinned);

    var currAddress = (char*)gcHandle.AddrOfPinnedObject();

    for (var i = 0; i < results.Length; i++)
    {
        results[i] = new string(currAddress, 0, subStringSize);
        currAddress += subStringSize;
    }

    return results;
}

To use the method for the case shown in the question:

const int size = 3000000;
const int subSize = 100;

var stringBuilder = new StringBuilder(size);
var random = new Random();

for (var i = 0; i < size; i++)
{
    stringBuilder.Append((char)random.Next(30, 80));
}

var hugeString = stringBuilder.ToString();

var stopwatch = Stopwatch.StartNew();
for (int i = 0; i < 1000; i++)
{
    var strings = Scan(hugeString, subSize);
}
stopwatch.Stop();

Console.WriteLine(stopwatch.ElapsedMilliseconds / 1000); // 43
SimpleVar
  • 14,044
  • 4
  • 38
  • 60
0

How long is a long time ? It shouldn't be that long.

var file = new StreamReader(@"E:\Temp\temp.txt");
var s = file.ReadToEnd();
var r = new Random();
var sw = new Stopwatch();
sw.Start();
var range = Enumerable.Range(0,1000000);
var results = range.Select( i => s.Substring(r.Next(s.Length - 100),100)).ToList();
sw.Stop();
sw.ElapsedMilliseconds.Dump();
s.Length.Dump();

So on my machine the results were 807ms and the string is 4,055,442 chars.

Edit: I just noticed that you want a string as a result, so my above solution just changes to...

var results = string.Join(Environment.NewLine,range.Select( i => s.Substring(r.Next(s.Length - 100),100)).ToArray());

And adds about 100ms, so still under a second in total.

Tim Jarvis
  • 18,465
  • 9
  • 55
  • 92
  • great this solution is superfast. – P basak Mar 21 '12 at 10:27
  • 1
    If speed is your only criteria you will find that this method is about three times slower than using char arrays and StringBuilder. I have to say it is elegant and concise code though. – Slugart Mar 21 '12 at 10:56