0

I am developing an application that reads and works with text files. These text files have the following structure:

** A comment
* A command
Data, data, data
** Some other comment
* Another command
1, 2, 3
4, 5, 6

I store the whole text file in memory by using string text = File.ReadAllText(file);. However, I want to remove all lines that are a comment, i.e. all lines starting with "**".

This is achievable by the following method:

// this method also removes any white-spaces (this is intended)
string RemoveComments(string textWithComments)
{
    string textWithoutComments = null;

    string[] split = Regex.Split(text.Replace(" ", null), "\r\n|\r|\n").ToArray();
    foreach (string line in split)
        if (line.Length >= 2 && line[0] == '*' && line[1] == '*') continue;
        else textWithoutComments += line + "\r\n";

    return textWithoutComments;
}

However this is actually incredibly slow for big files. I also think it is possible to replace the whole method by a single line of code (possibly by using Regex). How can I achieve this (I have also never used regex).

PS: I also want to avoid StreamReaders.

EDIT

An example file would look like this:

** Initial comment
*Command-0
** Some Comment: Header: Text
** Some text: text
*Command-1
**
** Some comment or text
**
*Command-2
*Command-3
      1,            2,            3
      2,            2,            4
      3,            2,            5
** END COMMENT
Carlos
  • 586
  • 1
  • 9
  • 19
  • Although it won't make the parsing itself faster, you should be using asynchronous IO. It is also not clear to me why you would use a `Regex` over `text.Split('\r', 'n')` and your `ToArray` call is pointless and potentially costly. – Aluan Haddad Aug 24 '20 at 22:03
  • 1
    How big are the files? – Alexandru Clonțea Aug 24 '20 at 22:12
  • Why do you want to avoid StreamReader? If you want this to be fast, processing the file using a StreamReader is what you want. – mtreit Aug 24 '20 at 22:15
  • @AluanHaddad Working with files with 100,000+ lines, replacing `string[] split = Regex.Split(text.Replace(" ", null), "\r\n|\r|\n").ToArray();` with `string[] split = text.Replace(" ", null).Split('\r', '\n');` execution times goes from around 100ms to around 60ms. The problem is in the `foreach` loop (execution takes several minutes). – Carlos Aug 24 '20 at 22:17
  • What you are trying to accomplish is pretty much a perfect use case for StreamReader. With that said, I agree with Aluan's suggestion as regex is overkill for this operation and most likely a major factor for slow runtime. Also use StringBuilder instead of string. It is much faster than appending to string. – Austin G Aug 24 '20 at 22:17
  • The reason I want to avoid `StreamReader` is because I already have the remaining application logic prepared to handle a `string text`. – Carlos Aug 24 '20 at 22:20
  • @Carlos - Any reason why you wouldn't do this: `File.WriteAllLines(@"", File.ReadLines(@"").Where(x => !x.StartsWith("**")));` ?? (I left the file names as `@""` for brevity.) – Enigmativity Aug 24 '20 at 23:36
  • @Enigmativity Actually, based on your comment, something like `var text = string.Join("\r\n", File.ReadLines(file).Where(x => !x.StartsWith("**")));` is a very good solution and it seems even faster (tested) than the `StringBuilder` method. – Carlos Aug 24 '20 at 23:47
  • @Carlos yes, the use of `where` also improves the readability and deferred execution can reduce memory consumption. Furthermore, you can do `File.ReadLines(file).Select((line, index) => (line, index)).AsParallel().Where(x => !x.line.StartsWith("**")).OrderBy(x => x.index).Select(x => x.line)` which can apply the predicate in parallel. – Aluan Haddad Aug 25 '20 at 05:42
  • @AluanHaddad - With that low a computation complexity the use of `AsParallel` on a IO operation is going to make it slower. – Enigmativity Aug 25 '20 at 05:49
  • @Enigmativity Well, it's already been read into memory so it's not doing any IO. – Aluan Haddad Aug 25 '20 at 05:55
  • @AluanHaddad - Not with `File.ReadLines(file)` it hasn't. And the final sort will be `O(n log n)` whereas the `StartsWith` is `O(n)`. It's slower to use parallel. – Enigmativity Aug 25 '20 at 05:59
  • 1
    @Enigmativity ah, I see your point I was thinking of an different file API. – Aluan Haddad Aug 25 '20 at 06:01

3 Answers3

2

Concatenating a string will reallocate memory each time the size of the string changes.

StringBuilder will not reallocate as often and will decrease* runtime significantly

string RemoveComments(string textWithComments)
{
    StringBuilder textWithoutComments = new StringBuilder();

    string[] split = text.Replace(" ", null).Split('\r', '\n');
    foreach (string line in split)
        if (line.Length >= 2 && line[0] == '*' && line[1] == '*') continue;
        else textWithoutComments.Append(line + "\r\n");

    return textWithoutComments.ToString();
}

Edited in Aluan's suggestion

Austin G
  • 136
  • 1
  • 7
  • Using `StringBuilder` the execution times are greatly improved, however using this method I get extra "\r\n". This is fixed by redoing the `else` line with `else if (!string.IsNullOrEmpty(line)) textWithoutComments.Append(line + "\r\n");` – Carlos Aug 24 '20 at 22:33
2

Why not just:

var text = @"** A comment
* A command
Data, data, data
** Some other comment
* Another command
1, 2, 3
4, 5, 6";

var textWithoutComments = Regex.Replace(text, @"(^|\n)\*\*.*(?=\n)", string.Empty); //this version will leave a \n at the beginning of the string if the text starts with a comment.
var textWithoutComments = Regex.Replace(text, @"(^\*\*.*\r\n)|((\r\n)\*\*.*($|(?=\r\n)))", string.Empty); //this versioh deals with that problem, for a longer regex that treats the first line differently than the other lines (consumes the \n rather than leaving it in the text)

Don't know about performance, I don't have test data at the ready...

PS: I also am inclined to believe that if you want top performance, some streaming might be ideal, you can always return a string from the method if that makes things easier for later processing. I think most people in this thread are suggesting StreamReader for the iteration/reading/interpreting part, regardless of the return type you decide to build.

Alexandru Clonțea
  • 1,746
  • 13
  • 23
  • Can you explain the regex pattern? In my files this removes some lines with comments, but not all (even though they all start with `"**"`). – Carlos Aug 24 '20 at 22:38
  • Can you post an example? I did notice a bug I'm trying to figure out. – Alexandru Clonțea Aug 24 '20 at 22:40
  • I'm trying to figure out how to treat it as a single line Regex because I was not able to find a multiline version that also removes the lines. The (^|\n) part matches either the beginning of the string of a line feed character. the \\*\\* match "**" and the rest just tries to locate the next line feed character, but something is wrong in my logic. To make things worse, it appears the engine on Regexr.com is behaving differently than the one in C#. @Carlos – Alexandru Clonțea Aug 24 '20 at 22:43
  • I've added another example. – Carlos Aug 24 '20 at 22:50
  • It seems that your method removes every other comment. – Carlos Aug 24 '20 at 22:52
  • Exactly, since the ending line feed was "consumed" rather than just "checked". So when the first match occurred for two lines in a row, the second one would be ignored. I've fixed that, but now there's another issue where if the first line is a comment, it is left as line feed rather than an empty line. @Carlos – Alexandru Clonțea Aug 24 '20 at 22:56
  • it seems to be working except that if there is a comment on the 1st line, the line does not get removed (although this is not really a problem). However, if there is a comment in the last line, this does not get removed/deleted. I've also updated the example (it should contain all possible cases for a comment). – Carlos Aug 24 '20 at 23:04
  • @Carlos I've found a solution for the second issue. – Alexandru Clonțea Aug 24 '20 at 23:04
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/220382/discussion-between-alexandru-clonea-and-carlos). – Alexandru Clonțea Aug 24 '20 at 23:07
0

I know you said you don't want to use StreamReader, but the following code can process 400,000 lines in less than half a second on my computer. It's simple, straight-forward and fast.

static void RemoveCommentsAndWhitespace(string filePath)
{
    if (!File.Exists(filePath))
    {
        Console.WriteLine($"ERR: The file '{filePath}' does not exist.", nameof(filePath));
    }

    string outfile = filePath + ".out";

    using StreamReader sr = new StreamReader(filePath);
    using StreamWriter sw = new StreamWriter(outfile);
    string line;

    while ((line = sr.ReadLine()) != null)
    {
        string tmp = line.Replace(" ", string.Empty);
        if (tmp.StartsWith("**"))
        {
            continue;
        }

        sw.WriteLine(tmp);
    }

    Console.WriteLine($"Wrote to {outfile}.");
}
mtreit
  • 789
  • 4
  • 6