73

Introduction

I have this favorite algorithm that I've made quite some time ago which I'm always writing and re-writing in new programming languages, platforms etc. as some sort of benchmark. Although my main programming language is C# I've just quite literally copy-pasted the code and changed the syntax slightly, built it in Java and found it to run 1000x faster.

The Code

There is quite a bit of code but I'm only going to present this snippet which seems to be the main issue:

for (int i = 0; i <= s1.Length; i++) 
{
    for (int j = i + 1; j <= s1.Length - i; j++)
    {
        string _s1 = s1.Substring(i, j);
        if (tree.hasLeaf(_s1))
         ...

The Data

It is important to point out that the string s1 in this particular test is of length 1 milion characters (1MB).

Measurements

I have profiled my code execution in Visual Studio because I thought the way I construct my tree or the way I traverse it isn't optimal. After examining the results it appears that the line string _s1 = s1.Substring(i, j); is accommodating for over 90% of the execution time!

Additional Observations

Another difference that I've noticed is that although my code is single threaded Java manages to execute it using all 8 cores (100% CPU utilization) while even with Parallel.For() and multi threading techniques my C# code manages to utilize 35-40% at most. Since the algorithm scales linearly with the number of cores (and frequency) I have compensated for this and still the snippet in Java executes order of magnitude 100-1000x faster.

Reasoning

I presume that the reason why this is happening has to do with the fact that strings in C# are immutable so String.Substring() has to create a copy and since it's within a nested for loop with many iterations I presume a lot of copying and garbage collecting is going on, however, I don't know how Substring is implemented in Java.

Question

What are my options at this point? There is no way around the number and length of substrings (this is already optimized maximally). Is there a method that I don't know of (or data structure perhaps) that could solve this issue for me?

Requested Minimal Implementation (from comments)

I have left out the implementation of the suffix tree which is O(n) in construction and O(log(n)) in traversal

public static double compute(string s1, string s2)
{
    double score = 0.00;
    suffixTree stree = new suffixTree(s2);
    for (int i = 0; i <= s1.Length; i++) 
    {
        int longest = 0;
        for (int j = i + 1; j <= s1.Length - i; j++)
        {
            string _s1 = s1.Substring(i, j);
            if (stree.has(_s1))
            {
                score += j - i;
                longest = j - i;
            }
            else break;
         };

        i += longest;
    };
    return score;
}

Screenshot snippet of the profiler

Note this was tested with string s1 the size of 300.000 characters. For some reason 1 milion characters just never finishes in C# while in Java it takes only 0.75 seconds.. The memory consumed and number of garbage collections don't seem to indicate a memory issue. The peak was about 400 MB but considering the huge suffix tree this appears to be normal. No weird garbage collecting patterns spotted either.

CPU profiler

Memory profiler

Ilhan
  • 1,309
  • 10
  • 24
  • 5
    `String` in Java is immutable as well. Have you tried `StringBuilder` instead? – Dan Aug 03 '18 at 13:25
  • I forgot to mention StringBuilder. I tried using it and generating substrings by hand (with both classic loops and LINQ) but as soon as I do .ToString() it slows down as much as String.Substring(); – Ilhan Aug 03 '18 at 13:28
  • Have you looked into converting to a `Stream`. Your current implementation is doing a double loop O(n^2), which isn't optimal. – Dan Aug 03 '18 at 13:28
  • The inner loop has an exit condition (break) which makes it O(n log n) in general. I haven't tried using streams for that matter though – Ilhan Aug 03 '18 at 13:28
  • @llhan: can't your `stree.has(..)` operate over range of indices of, "slice", of he original string, rather than creating a whole new object every time ? Like -`if (stree.has(s1, i, j))` – Tigran Aug 03 '18 at 13:37
  • 1
    I guess you have a memory issue. Have you looked at it? – Tim Schmelter Aug 03 '18 at 13:38
  • @Tigran please reload the page I have provided the screenshot from the profiler. There can't be anything wrong with "stree.has()" – Ilhan Aug 03 '18 at 13:43
  • @TimSchmelter The program doesn't seem to consume an abnormal number of memory, I've presented the memory profiler screenshot in the edit of the post – Ilhan Aug 03 '18 at 13:46
  • This code doesn't make a lot of sense. I think you have an array of nodes, not strings. You c# code has a string : n1n2n3n" You are looking for leaf "n1n2n3n","1n2n3","n2n3",2n3","n3","n","1n2n3n","n2n3",2n3","n3","3","n2n3n","2n3n","n3n","3n","n","2n3n","n3n","3n","n","n3n","3n","n","3n","n","n". – jdweng Aug 03 '18 at 13:50
  • @llhan: it's not about `has(..)` being wrong, it's about do not have to run substring at all, but just get indeces `from` -> `to` of the original string, and find a key using a like a key a range of symbols from *original* string directly. – Tigran Aug 03 '18 at 13:50
  • Hmm that seems interesting I will give it a try using a string builder, declaring the strings static and accessable from within the tree and pass in only the indexes and see what happens – Ilhan Aug 03 '18 at 13:52
  • However I'm not sure how to implement the latter thing you wrote without actually copying the content and constructing a new string by my self. I would know how to pull it off in C++ though @Tigran – Ilhan Aug 03 '18 at 13:55
  • @jdweng A suffix tree is a tree made of suffixes implemented as a Dictionary with construction time O(n) (Ukkoken's Algorithm) and traversal O(log(n)) and is a data structure for indexing big strings where classical Boyer Moore Horspol string.Contains() is inefficient due to the number of searches being done – Ilhan Aug 03 '18 at 13:59
  • Ok. Where is you suffix tree? Why are you doing a leaf test for "n" 5 times in the example I posted? – jdweng Aug 03 '18 at 14:10
  • For "ABC" it cycles A, AB, ABC, B, BC, C. (actually it does ABC first then AB then A because of how string.Substring() works in C# it was easier to write it like that) and checks if the second string s2 contains these substrings using the suffixtree rather than string.Contains(). It never repeats the same character twice. However if you have say "AAC" it will check for A twice. This is a string similarity metric it computes the frequency of occurring weighted by the length of the substring – Ilhan Aug 03 '18 at 14:16
  • 2
    Seven of those eight cores in Java are probably used for garbage collecting your substrings :) – hoodaticus Aug 03 '18 at 14:18
  • 1
    Haha that might be it.. :'). Do you have any idea syntaxically how I could get my substrings without making copies all the time in C#? I can't just const char*& and use pointer arithmetic like in C++.. – Ilhan Aug 03 '18 at 14:21
  • @Ilhan - the optimal way of doing this in C++ is also the optimal way of doing it in C#. You can use char* in C#, just decorate the class with unsafe. – hoodaticus Aug 03 '18 at 14:21
  • The main mystery which I dont know how to solve is how to allocate memory for char* in C# on the heap within unsafe block? I guess I could put the whole function in unsafe {} – Ilhan Aug 03 '18 at 14:35
  • 1
    just create function like `if (stree.has(string, indexStart, indexEnd))` and never actually do any substring operations, just check it manually char by char. Or instead of adding such dirty method create object that would represent string but with movable view, so object that store string, and start/end index (probably mutable if you want better performance) and use it in that method instead of String. (also without using substring) – GotoFinal Aug 03 '18 at 15:14
  • 4
    Until C# gets `Span`, as other commenters pointed out simply use `(string, startIndex, endIndex)` in methods like `stree.has`. Inside the methods use string indexer (`s[i]`) which returns `char` w/o allocation. – Ivan Stoev Aug 03 '18 at 17:51

1 Answers1

87

Issue Origin

After having a glorious battle that lasted two days and three nights (and amazing ideas and thoughts from the comments) I've finally managed to fix this issue!

I'd like to post an answer for anybody running into similar issues where the string.Substring(i, j) function is not an acceptable solution to get the substring of a string because the string is either too large and you can't afford the copying done by string.Substring(i, j) (it has to make a copy because C# strings are immutable, no way around it) or the string.Substring(i, j) is being called a huge number of times over the same string (like in my nested for loops) giving the garbage collector a hard time, or as in my case both!

Attempts

I've tried many suggested things such as the StringBuilder, Streams, unmanaged memory allocation using Intptr and Marshal within the unsafe{} block and even creating an IEnumerable and yield return the characters by reference within the given positions. All of these attempts failed ultimatively because some form of joining of the data had to be done as there was no easy way for me to traverse my tree character by character without jeopardizing performance. If only there was a way to span over multiple memory addresses within an array at once like you would be able to in C++ with some pointer arithmetic.. except there is.. (credits to @Ivan Stoev's comment)

The Solution

The solution was using System.ReadOnlySpan<T> (couldn't be System.Span<T> due to strings being immutable) which, among other things, allows us to read sub arrays of memory addresses within an existing array without creating copies.

This piece of the code posted:

string _s1 = s1.Substring(i, j);
if (stree.has(_s1))
{
    score += j - i;
    longest = j - i;
}

Was changed to the following:

if (stree.has(i, j))
{
    score += j - i;
    longest = j - i;
}

Where stree.has() now takes two integers (position and length of substring) and does:

ReadOnlySpan<char> substr = s1.AsSpan(i, j);

Notice that the substr variable is literally a reference to a subset of characters of the initial s1 array and not a copy! (The s1 variable had been made accessible from this function)

Note that at the moment of writing this I am using C#7.2 and .NET Framework 4.6.1 meaning that to get the Span feature I had to go to Project > Manage NuGet Packages, tick the "Include prerelease" checkbox and browse for System.Memory and install it.

Re-running the initial test (on strings of length 1 milion characters i.e. 1MB) the speed increased from 2+ minutes (I gave up waiting after 2 minutes) to ~86 miliseconds!!

Ilhan
  • 1,309
  • 10
  • 24
  • 2
    Can slice as part of creating the Span: `s1.AsSpan(i, j)`, should be a little faster? – Ben Adams Aug 12 '18 at 16:14
  • It could be since I don't know how exactly span is implemented. It doesn't appear to be faster but it is intuitive to think that it is .. at least I think so. I will edit my post and use your suggestion as it is probably the intended way to use span @BenAdams – Ilhan Aug 12 '18 at 16:21
  • 3
    More informations about Span, if you are interested. (Just for completeness) https://msdn.microsoft.com/en-us/magazine/mt814808.aspx – El Mac Aug 13 '18 at 11:26