2

I have very large files that need to be read into memory. These files must be in a human readable format, and so they are polluted with tab indenting until normal characters appear... For example the following text is preceded with 3 spaces (which is equivalent to one tab indent)

    /// There is a tab before this text.
    Sample Text   There is a tab in between the word "Text" and "There" in this line.
    9919
    2250
    {
        ""
        5
        255
    }

Currently I simply run the following code to replace the tabs (after the file has been loaded into memory)...

void FileParser::ReplaceAll(
   std::string& the_string,
   const std::string& from,
   const std::string& to) const
{
   size_t start_pos = 0;
   while ((start_pos = the_string.find(from, start_pos)) != std::string::npos)
   {
      the_string.replace(start_pos, from.length(), to);
      start_pos += to.length(); // In case 'to' contains 'from', like replacing 'x' with 'yx'
   }
}

There are two issues with this code...

  • It takes 18 seconds to just complete the replacing of this text.
  • This replaces ALL tabs, I just want the tabs up until the first non-tab character. So if the line has tabs after the non-tab characters.... these would not be removed.

Can anyone offer up a solution that would speed up the process and only remove the initial tab indents of each line?

Rick
  • 421
  • 3
  • 15

2 Answers2

2

I'd do it this way:

std::string without_leading_chars(const std::string& in, char remove)
{
    std::string out;
    out.reserve(in.size());
    bool at_line_start = true;
    for (char ch : in)
    {
        if (ch == '\n')
            at_line_start = true;
        else if (at_line_start)
        {
            if (ch == remove)
                continue; // skip this char, do not copy
            else
                at_line_start = false;
        }
        out.push_back(ch);
    }
    return out;
}

That's one memory allocation and a single pass, so pretty close to optimal.

John Zwinck
  • 239,568
  • 38
  • 324
  • 436
1

As always. We can often gain more speed by thinking of good algorithms and create a good design.

First comment. I tested your approach with a 100MB source file and it took at least 30 minutes on my machine in Release mode with all optimizations on.

And, as you mentioned by yourself. It repalces all spaces, and not only those at the beginning of the file. So, we need to come up with a better solution

First we think of how we can identify spaces at the beginning of a line. For this we need some boolean flag that indicates that we are at the beginning of a line. We will call it beginOfLine and set it to true initially, because the file starts always with a line.

Then, next, we check, if the next character is a space ' ' or a tab '\t' character. In contrast to other solutions, we will check for both.

If this is the case, we do then not need to consider that space or tab in the output, depending, if we are at begin of the line or not. So, the result is the inverse of beginOfLine.

If the character is not a space or tab, then we check for a newline. If we found one, then we set the beginOfLine flag to true, else to false. In any case, we want to use the character.

All this can be put into a simple stateful Lambda

        auto check = [beginOfLine = true](const char c) mutable -> bool {
            if ((c == ' ') || (c == '\t')  ) 
                return !beginOfLine;    
            beginOfLine = (c == '\n'); 
            return true; };

or, more compact:

auto check = [beginOfLine = true](const char c) mutable -> bool {
    if (c == ' ' || c == '\t') return !beginOfLine; beginOfLine = (c == '\n'); return true; };

Then, next. We will not erase the spaces from the original string, because this is a huge memory shifting activity that takes brutally long. Instead, we copy the data (characters) to a new string, but just the needed onces.

And for that, we can use the std::copy_if from the standard library.

std::copy_if(data.begin(), data.end(), data2.begin(), check);

This will do the work. And for 100MB data, it takes 160ms. Compared to 30 minutes this is a tremondous saving.

Please see the example code (that of course needs to be addapted for your needs):

#include <iostream>
#include <fstream>
#include <filesystem>
#include <iterator>
#include <algorithm>
#include <string>

namespace fs = std::filesystem;
constexpr size_t SizeOfIOStreamBuffer = 1'000'000;
static char ioBuffer[SizeOfIOStreamBuffer];

int main() {

    // Path to text file
    const fs::path file{ "r:\\test.txt" };

    // Open the file and check, if it could be opened
    if (std::ifstream fileStream(file); fileStream) {

        // Lambda, that identifies, if we have a spece or tab at the begin of a line or not
        auto check = [beginOfLine = true](const char c) mutable -> bool {
            if (c == ' ' || c == '\t') return !beginOfLine; beginOfLine = (c == '\n'); return true; };

        // Huge string with all file data
        std::string data{};

        // Reserve space to spped up things and to avoid uncessary allocations
        data.resize(fs::file_size(file));

        // Used buffered IO with a huge iobuffer
        fileStream.rdbuf()->pubsetbuf(ioBuffer, SizeOfIOStreamBuffer);

        // Read file, Elimiate spaces and tabs at the beginning of the line and store in data
        std::copy_if(std::istreambuf_iterator<char>(fileStream), {}, data.begin(), check);
    }
    return 0;
}

As you can see, all boils done to one statement in the code. And this runs (on my machine) in 160ms for a 100MB file.

What can be optimized further? Of course, we see that we have 2 100MB std::strings in our software. What a waste. The final optimization would be, to put the 2 statements for file reading and removing spaces and tabs at the beginning of a line , into one statement.

std::copy_if(std::istreambuf_iterator<char>(fileStream), {}, data.begin(), check);

We will have then have only 1 time the data in memory, and eliminate the nonsense that we read data from a file that we do not need. And the beauty of it is that by using modern C++ language elements, only minor modificyations are necessary. Just exchange the source iterators:

Yes, I know that the string size is too big in the end, but it can be set to the actual value easily. For exampe by using data.reserve(...) and back::inserter

A M
  • 14,694
  • 5
  • 19
  • 44