3

I am working with files that range between 150MB and 250MB, and I need to append a form feed (/f) character to each match found in a match collection. Currently, my regular expression for each match is this:

Regex myreg = new Regex("ABC: DEF11-1111(.*?)MORE DATA(.*?)EVEN MORE DATA(.*?)\f", RegexOptions.Singleline);

and I'd like to modify each match in the file (and then overwrite the file) to become something that could be later found with a shorter regular expression:

Regex myreg = new Regex("ABC: DEF11-1111(.*?)\f\f, RegexOptions.Singleline);

Put another way, I want to simply append a form feed character (\f) to each match that is found in my file and save it.

I see a ton of examples on stack overflow for replacing text, but not so much for larger files. Typical examples of what to do would include:

  • Using streamreader to store the entire file in a string, then do a find and replace in that string.
  • Using MatchCollection in combination with File.ReadAllText()
  • Read the file line by line and look for matches there.

The problem with the first two is that is just eats up a ton of memory, and I worry about the program being able to handle all of that. The problem with the 3rd option is that my regular expression spans over many rows, and thus will not be found in a single line. I see other posts out there as well, but they cover replacing specific strings of text rather than working with regular expressions.

What would be a good approach for me to append a form feed character to each match found in a file, and then save that file?

Edit:

Per some suggestions, I tried playing around with StreamReader.ReadLine(). Specifically, I would read a line, see if it matched my expression, and then based on that result I would write to a file. If it matched the expression, I would write to the file. If it didn't match the expression, I would just append it to a string until it did match the expression. Like this:

Regex myreg = new Regex("ABC: DEF11-1111(.?)MORE DATA(.?)EVEN MORE DATA(.*?)\f", RegexOptions.Singleline);

//For storing/comparing our match.
string line, buildingmatch, match, whatremains;
buildingmatch = "";
match = "";
whatremains = "";

//For keep track of trailing bits after our match.
int matchlength = 0;

using (StreamWriter sw = new StreamWriter(destFile))
using (StreamReader sr = new StreamReader(srcFile))
{
    //While we are still reading lines in the file...
    while ((line = sr.ReadLine()) != null)
    {
        //Keep adding lines to buildingmatch until we can match the regular expression.
        buildingmatch = buildingmatch + line + "\r\n";
        if (myreg.IsMatch(buildingmatch)
        {
            match = myreg.Match(buildingmatch).Value;
            matchlength = match.Lengh;
            
            //Make sure we are not at the end of the file.
            if (matchlength < buildingmatch.Length)
            {
                whatremains = buildingmatch.SubString(matchlength, buildingmatch.Length - matchlength);
            }
            
            sw.Write(match, + "\f\f");
            buildingmatch = whatremains;
            whatremains = "";
        }
    }
}

The problem is that this took about 55 minutes to run a roughly 150MB file. There HAS to be a better way to do this...

nightmare637
  • 635
  • 5
  • 19
  • Some things to try: Ask if just matching the tail is enough to know the whole is matched. Or can do a complicated user beffering stack that windows your data. Or you could read the whole string at once to memory. Nowadays, drives read up to 4 GB / sec and ram is cheaply 32 GB's – sln Oct 29 '21 at 16:49
  • Unfortunately, the tail is not enough to know the whole is matched; there are multiple form feeds in each match. – nightmare637 Oct 29 '21 at 16:58
  • its sounds like you can consider another approach. For example, can you read strng by chunks from one `ABC:` to another? And make match on chunk? – tym32167 Oct 29 '21 at 17:02
  • You could try to change the EOL character in your readline to match the start and stop delimiters.. – sln Oct 29 '21 at 17:09
  • @sln Thank you for taking the time to comment! I tried playing around with readline (I modified my original question with what I did), but it ended up taking almost an hour to process the file. Also, I didn't see any way to modify the start/stop delimiters in my readline that were valid for C#. Could there be a more efficient way to do what I updated in my original question? – nightmare637 Oct 29 '21 at 23:57
  • @tym32167 Thanks for the reply! Unfortunately, the way the file is, matching making on from one ABC: to another would not be possible. In fact, ABC: itself appears multiple times in the same match; there is unfortunately a lot of repeated data which is why I need to use the regex that I have. – nightmare637 Oct 30 '21 at 00:00
  • 1
    Well, why it is taking so long is that this `buildingmatch +=line;` constantly adds a line of text to your buffer. After each _add_ you're running the same regex that starts searching from the beginning of the buffer every time. It doesn't match because it can't find the end because it hasn't been added yet. Its like !N (N-factorial). Imagine searching for a `Q` in a buffer that expands 1 character at a time when each search starts from the beginning and the `Q` is added after the 10,000 pass. – sln Oct 30 '21 at 18:43
  • If you're trying to do a fifo type of buffer stack, it would be wise to keep some metrics of USAGE max buffer size, max lines, avg line size. This will tell you the efficacy if this is a good way to do it. – sln Oct 30 '21 at 18:50
  • 1
    Finally, if you need to stay with getting the whole record in the buffer, the better way is to identify the beginning within a line, something you know is constant. So like run a partial of your main regex: `ABC: DEF11-1111`, on a line. If it matches, start buffering anew. Then look for the end in each line `\f`. adding to the buffer if not found. If it is found, run once the main regex on the buffer. If it matches, take the record off, clear buffer and start looking for new record. If it doesn't, keep adding lines, but only check lines for the ending. Etc... – sln Oct 30 '21 at 19:10
  • Note when I say clear the regex, you would probably append it to your output file. Another time saver is to also search each line for another _NEW_ record start. This will let you know that you could append the current buffer to output file, then clear the buffer. This trick will save 40% in lost time re-searching text that was part of the last record that is not the one (record) you are seeking. – sln Oct 30 '21 at 19:24
  • @sln, Thanks you for your insightful comments! Your first response really clarified why it's taking so long, and I understand that now. Probably would make more sense to generate a few MBs worth of lines into the buffer (as opposed to one line at a time), parse out all the matches into a match collection and append them to the new file, and then store the remaining lines in the buffer and start again. – nightmare637 Oct 31 '21 at 01:15
  • Your 3rd comment really gave me a great idea! I'll try it on Monday, and if it works, I'll post the solution here. Thank you so much!!! – nightmare637 Oct 31 '21 at 01:19

4 Answers4

2

If you can load the whole string data into a single string variable, there is no need to first match and then append text to matches in a loop. You can use a single Regex.Replace operation:

string text = File.ReadAllText(srcFile);
using (StreamWriter sw = new StreamWriter(destfile, false, Encoding.UTF8, 5242880))
{
     sw.Write(myregex.Replace(text, "$&\f\f"));
}

Details:

  • string text = File.ReadAllText(srcFile); - reads the srcFile file to the text variable (match would be confusing)
  • myregex.Replace(text, "$&\f\f") - replaces all occurrences of myregex matches with themselves ($& is a backreference to the whole match value) while appending two \f chars right after each match.
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
1

I was able to find a solution that works in a reasonable time; it can process my entire 150MB file in under 5 minutes.

First, as mentioned in the comments, it's a waste to compare the string to the Regex after every iteration. Rather, I started with this:

string match = File.ReadAllText(srcFile);
MatchCollection mymatches = myregex.Matches(match);

Strings can hold up to 2GB of data, so while not ideal, I figured roughly 150MB worth wouldn't hurt to be stored in a string. Then, as opposed to checking a match every x amount of lines read in from the file, I can check the file for matches all at once!

Next, I used this:

StringBuilder matchsb = new StringBuilder(134217728);
foreach (Match m in mymatches)
{
     matchsb.Append(m.Value + "\f\f");
}

Since I already know (roughly) the size of my file, I can go ahead and initialize my stringbuilder. Not to mention, it's a lot more efficient to use string builder if you are doing multiple operations on a string (which I was). From there, it's just a matter of appending the form feed to each of my matches.

Finally, the part the cost the most on performance:

using (StreamWriter sw = new StreamWriter(destfile, false, Encoding.UTF8, 5242880))
{
     sw.Write(matchsb.ToString());
}

The way that you initialize StreamWriter is critical. Normally, you just declare it as:

StreamWriter sw = new StreamWriter(destfile);

This is fine for most use cases, but the problem becomes apparent with you are dealing with larger files. When declared like this, you are writing to the file with a default buffer of 4KB. For a smaller file, this is fine. But for 150MB files? This will end up taking a long time. So I corrected the issue by changing the buffer to approximately 5MB.

I found this resource really helped me to understand how to write to files more efficiently: https://www.jeremyshanks.com/fastest-way-to-write-text-files-to-disk-in-c/

Hopefully this will help the next person along as well.

nightmare637
  • 635
  • 5
  • 19
  • 1
    I would be interested to know if using the `Replace` method on your regex might further improve performance. If it didn't, it would at least make the code a lot simpler, as you wouldn't need the `StringBuilder` any more... – Simon MᶜKenzie Nov 02 '21 at 20:13
  • 1
    That might be a good idea; I'll try and it let you know the outcome! – nightmare637 Nov 02 '21 at 20:21
  • 1
    @SimonMᶜKenzie, The Replace method did not significantly improve the performance, but as you noted, it did make the code a lot simpler. Thanks again for your suggestion! – nightmare637 Nov 02 '21 at 21:15
  • .NET 6 can probably improve performance even more for you, unless you're already using it. Read this: https://devblogs.microsoft.com/dotnet/file-io-improvements-in-dotnet-6/ – Bent Tranberg Nov 02 '21 at 21:34
  • @nightmare637 if you're still doing this kind of thing you might want to look at [Gigantor](https://github.com/imagibee/Gigantor). It supports regex search and replace on gigantic files. The 32 GB test completes in 38 seconds on my laptop with 13,952 match/replaces performed. So for your 250 MB data that should take less than a second. – dynamicbutter Apr 06 '23 at 00:08
0

When working with large text files in C# and needing to perform search and replace operations, there are a few approaches that can be considered to optimize performance.

One approach is to use memory-mapped files,Memory-mapped files allow you to access large files as if they were in-memory arrays, which can be more efficient than using standard file I/O. To use memory-mapped files, you can use the MemoryMappedFile class in C#.

if memory-mapped files are a viable option, they can provide faster access to the file's contents than traditional reading and writing methods.

0

And even if your use case is for files that will not fit into RAM Gigantor makes it quick and easy.

// Create the progress event required by Gigantor
System.Threading.AutoResetEvent progress = new(false);

// Create a regular expression
System.Text.RegularExpressions.Regex regex = new(
    "ABC: DEF11-1111(.*?)MORE DATA(.*?)EVEN MORE DATA(.*?)\f",
    RegexOptions.Compiled);

// Create the searcher
Imagibee.Gigantor.RegexSearcher searcher = new(srcPath, regex, progress);

// Do the search
Imagibee.Gigantor.Background.StartAndWait(searcher, progress, (_) => { });

// Add extra form feed to each match
using System.IO.FileStream output = File.Create(destPath);
searcher.Replace(output, (match) => { return $"{match.Value}\f"; } );
dynamicbutter
  • 311
  • 1
  • 5