How can I count the occurences of a certain character combination in a text file in the fastest possible way?

Question

I would like to make a method that counts the occurences of a series of characters in a .txt file (C#). I've found some related questions here that have valid answers. However, there are certain circumstances that restrict the possible solutions:

The method has to work quite fast, because I have to use it more hundred times in the program.
The text in the file is overlong to be read in a string.

Thank you for your help.

Just read the file in batches your program can handle, and then process each batch. Edge case: keep reading each batch until hitting a proper word boundary. — Tim Biegeleisen, Aug 23 '18 at 09:06
Thanks. Does it mean there are no methods to read the next 2/3/X characters in a StreamReader without consuming them? — Alkadikce, Aug 23 '18 at 09:16
What's wrong with consuming them, assuming you just want a count? — Tim Biegeleisen, Aug 23 '18 at 09:16
Did you even try anything? We´re definitly not doing your job here, which is **thinking**, **trying**, **thinking again**. — MakePeaceGreatAgain, Aug 23 '18 at 09:19
I looked through all the StreamReader methods, and there weren't any for moving the cursor back. — Alkadikce, Aug 23 '18 at 09:21
"The text in the file is overlong to be read in a string" Huuum? A single string can capture some ten-tousands of characters. It´s only limited by the per-object size of your app, which is usually 2GB for 32bit-system. So unless you don´t provide an actual **problem** that shows what you´ve tried already and what results you expect there´s not much we can do here. In particular please read [How to create a Minimal, Complete, and Verifiable example](https://stackoverflow.com/help/mcve). — MakePeaceGreatAgain, Aug 23 '18 at 09:21
Yes, the text is a series of long novels, it's more hundred thousand characters. — Alkadikce, Aug 23 '18 at 09:24
I can solve the problem in itself, my problem is just with the length of the text and the time it takes. Otherwise I know how to cound character series occurences — Alkadikce, Aug 23 '18 at 10:22
*it's more then a hundred thousand characters*. Quite small amount of text. You just need Linq for this. No parallel processing required. Use the cached methods. — Jimi, Aug 23 '18 at 11:13

Adam Simon · Answer 1 · 2018-08-23T12:59:54.490

The method has to work quite fast, because I have to use it more hundred times in the program.

According to recent benchmarks, SequenceEqual of Span<T> tends to be the fastest way to compare array slices in .NET nowadays (except for unsafe or P/Invoke approaches).

The text in the file is overlong to be read in a string.

This issue can easily be tackled using FileStream or StreamReader.

In a nutshell, you need to read the file chunked: read a fixed size part from the file, look for occurences in it, read the next part, look for occurences, and so on. This can be coded without moving back the cursor, just the leftover of each part needs to be taken into account when dealing with the next part.

Here is my approach using FileStream and Span<T>:

public static int CountOccurences(Stream stream, string searchString, Encoding encoding = null, int bufferSize = 4096)
{
    if (stream == null)
        throw new ArgumentNullException(nameof(stream));

    if (searchString == null)
        throw new ArgumentNullException(nameof(searchString));

    if (!stream.CanRead)
        throw new ArgumentException("Stream must be readable.", nameof(stream));

    if (bufferSize <= 0)
        throw new ArgumentException("Buffer size must be a positive number.", nameof(bufferSize));

    // detecting encoding
    Span<byte> bom = stackalloc byte[4];

    var actualLength = stream.Read(bom);
    if (actualLength == 0)
        return 0;

    bom = bom.Slice(0, actualLength);

    Encoding detectedEncoding;
    if (bom.StartsWith(Encoding.UTF8.GetPreamble()))
        detectedEncoding = Encoding.UTF8;
    else if (bom.StartsWith(Encoding.UTF32.GetPreamble()))
        detectedEncoding = Encoding.UTF32;
    else if (bom.StartsWith(Encoding.Unicode.GetPreamble()))
        detectedEncoding = Encoding.Unicode;
    else if (bom.StartsWith(Encoding.BigEndianUnicode.GetPreamble()))
        detectedEncoding = Encoding.BigEndianUnicode;
    else
        detectedEncoding = null;

    if (detectedEncoding != null)
    {
        if (encoding == null)
            encoding = detectedEncoding;

        if (encoding == detectedEncoding)
            bom = bom.Slice(detectedEncoding.GetPreamble().Length);
    }
    else if (encoding == null)
        encoding = Encoding.ASCII;

    // acquiring a buffer
    ReadOnlySpan<byte> searchBytes = encoding.GetBytes(searchString);

    bufferSize = Math.Max(Math.Max(bufferSize, searchBytes.Length), 128);

    var bufferArray = ArrayPool<byte>.Shared.Rent(bufferSize);
    try
    {
        var buffer = new Span<byte>(bufferArray, 0, bufferSize);

        // looking for occurences
        bom.CopyTo(buffer);
        actualLength = bom.Length + stream.Read(buffer.Slice(bom.Length));
        var occurrences = 0;
        do
        {
            var index = 0;
            var endIndex = actualLength - searchBytes.Length;
            for (; index <= endIndex; index++)
                if (buffer.Slice(index, searchBytes.Length).SequenceEqual(searchBytes))
                    occurrences++;

            if (actualLength < buffer.Length)
                break;

            ReadOnlySpan<byte> leftover = buffer.Slice(index);
            leftover.CopyTo(buffer);
            actualLength = leftover.Length + stream.Read(buffer.Slice(leftover.Length));
        }
        while (true);

        return occurrences;
    }
    finally { ArrayPool<byte>.Shared.Return(bufferArray); }
}

This code requires C# 7.2 to compile. You may have to include the System.Buffers and System.Memory NuGet packages, as well. If you use .NET Core version lower than 2.1 or another platform than .NET Core, you need to include this "polyfill", as well:

static class Compatibility
{
    public static int Read(this Stream stream, Span<byte> buffer)
    {
        // copied over from corefx sources (https://github.com/dotnet/corefx/blob/master/src/Common/src/CoreLib/System/IO/Stream.cs)
        byte[] sharedBuffer = ArrayPool<byte>.Shared.Rent(buffer.Length);
        try
        {
            int numRead = stream.Read(sharedBuffer, 0, buffer.Length);
            if ((uint)numRead > buffer.Length)
                throw new IOException("Stream was too long.");

            new Span<byte>(sharedBuffer, 0, numRead).CopyTo(buffer);
            return numRead;
        }
        finally { ArrayPool<byte>.Shared.Return(sharedBuffer); }
    }
}

Usage:

using (var fs = new FileStream(@"path-to-file", FileMode.Open, FileAccess.Read, FileShare.Read))
    Console.WriteLine(CountOccurences(fs, "string to search"));

When you don't specify the encoding argument, the encoding will be auto-detected by examining the BOM of the file. If BOM is not present, ASCII encoding is assumed as a fallback.

Thank you very much, I am going to add another comment after reading and interpreting it carefully. — Alkadikce, Aug 24 '18 at 11:22

How can I count the occurences of a certain character combination in a text file in the fastest possible way?

1 Answers1