The method has to work quite fast, because I have to use it more hundred times in the program.
According to recent benchmarks, SequenceEqual of Span<T> tends to be the fastest way to compare array slices in .NET nowadays (except for unsafe or P/Invoke approaches).
The text in the file is overlong to be read in a string.
This issue can easily be tackled using FileStream or StreamReader.
In a nutshell, you need to read the file chunked: read a fixed size part from the file, look for occurences in it, read the next part, look for occurences, and so on. This can be coded without moving back the cursor, just the leftover of each part needs to be taken into account when dealing with the next part.
Here is my approach using FileStream and Span<T>:
public static int CountOccurences(Stream stream, string searchString, Encoding encoding = null, int bufferSize = 4096)
{
if (stream == null)
throw new ArgumentNullException(nameof(stream));
if (searchString == null)
throw new ArgumentNullException(nameof(searchString));
if (!stream.CanRead)
throw new ArgumentException("Stream must be readable.", nameof(stream));
if (bufferSize <= 0)
throw new ArgumentException("Buffer size must be a positive number.", nameof(bufferSize));
// detecting encoding
Span<byte> bom = stackalloc byte[4];
var actualLength = stream.Read(bom);
if (actualLength == 0)
return 0;
bom = bom.Slice(0, actualLength);
Encoding detectedEncoding;
if (bom.StartsWith(Encoding.UTF8.GetPreamble()))
detectedEncoding = Encoding.UTF8;
else if (bom.StartsWith(Encoding.UTF32.GetPreamble()))
detectedEncoding = Encoding.UTF32;
else if (bom.StartsWith(Encoding.Unicode.GetPreamble()))
detectedEncoding = Encoding.Unicode;
else if (bom.StartsWith(Encoding.BigEndianUnicode.GetPreamble()))
detectedEncoding = Encoding.BigEndianUnicode;
else
detectedEncoding = null;
if (detectedEncoding != null)
{
if (encoding == null)
encoding = detectedEncoding;
if (encoding == detectedEncoding)
bom = bom.Slice(detectedEncoding.GetPreamble().Length);
}
else if (encoding == null)
encoding = Encoding.ASCII;
// acquiring a buffer
ReadOnlySpan<byte> searchBytes = encoding.GetBytes(searchString);
bufferSize = Math.Max(Math.Max(bufferSize, searchBytes.Length), 128);
var bufferArray = ArrayPool<byte>.Shared.Rent(bufferSize);
try
{
var buffer = new Span<byte>(bufferArray, 0, bufferSize);
// looking for occurences
bom.CopyTo(buffer);
actualLength = bom.Length + stream.Read(buffer.Slice(bom.Length));
var occurrences = 0;
do
{
var index = 0;
var endIndex = actualLength - searchBytes.Length;
for (; index <= endIndex; index++)
if (buffer.Slice(index, searchBytes.Length).SequenceEqual(searchBytes))
occurrences++;
if (actualLength < buffer.Length)
break;
ReadOnlySpan<byte> leftover = buffer.Slice(index);
leftover.CopyTo(buffer);
actualLength = leftover.Length + stream.Read(buffer.Slice(leftover.Length));
}
while (true);
return occurrences;
}
finally { ArrayPool<byte>.Shared.Return(bufferArray); }
}
This code requires C# 7.2 to compile. You may have to include the System.Buffers and System.Memory NuGet packages, as well. If you use .NET Core version lower than 2.1 or another platform than .NET Core, you need to include this "polyfill", as well:
static class Compatibility
{
public static int Read(this Stream stream, Span<byte> buffer)
{
// copied over from corefx sources (https://github.com/dotnet/corefx/blob/master/src/Common/src/CoreLib/System/IO/Stream.cs)
byte[] sharedBuffer = ArrayPool<byte>.Shared.Rent(buffer.Length);
try
{
int numRead = stream.Read(sharedBuffer, 0, buffer.Length);
if ((uint)numRead > buffer.Length)
throw new IOException("Stream was too long.");
new Span<byte>(sharedBuffer, 0, numRead).CopyTo(buffer);
return numRead;
}
finally { ArrayPool<byte>.Shared.Return(sharedBuffer); }
}
}
Usage:
using (var fs = new FileStream(@"path-to-file", FileMode.Open, FileAccess.Read, FileShare.Read))
Console.WriteLine(CountOccurences(fs, "string to search"));
When you don't specify the encoding argument, the encoding will be auto-detected by examining the BOM of the file. If BOM is not present, ASCII encoding is assumed as a fallback.