As always. We can often gain more speed by thinking of good algorithms and create a good design.
First comment. I tested your approach with a 100MB source file and it took at least 30 minutes on my machine in Release mode with all optimizations on.
And, as you mentioned by yourself. It repalces all spaces, and not only those at the beginning of the file. So, we need to come up with a better solution
First we think of how we can identify spaces at the beginning of a line. For this we need some boolean flag that indicates that we are at the beginning of a line. We will call it beginOfLine
and set it to true
initially, because the file starts always with a line.
Then, next, we check, if the next character is a space ' '
or a tab '\t'
character. In contrast to other solutions, we will check for both.
If this is the case, we do then not need to consider that space or tab in the output, depending, if we are at begin of the line or not. So, the result is the inverse of beginOfLine
.
If the character is not a space or tab, then we check for a newline. If we found one, then we set the beginOfLine
flag to true, else to false. In any case, we want to use the character.
All this can be put into a simple stateful Lambda
auto check = [beginOfLine = true](const char c) mutable -> bool {
if ((c == ' ') || (c == '\t') )
return !beginOfLine;
beginOfLine = (c == '\n');
return true; };
or, more compact:
auto check = [beginOfLine = true](const char c) mutable -> bool {
if (c == ' ' || c == '\t') return !beginOfLine; beginOfLine = (c == '\n'); return true; };
Then, next. We will not erase the spaces from the original string, because this is a huge memory shifting activity that takes brutally long. Instead, we copy the data (characters) to a new string, but just the needed onces.
And for that, we can use the std::copy_if
from the standard library.
std::copy_if(data.begin(), data.end(), data2.begin(), check);
This will do the work. And for 100MB data, it takes 160ms. Compared to 30 minutes this is a tremondous saving.
Please see the example code (that of course needs to be addapted for your needs):
#include <iostream>
#include <fstream>
#include <filesystem>
#include <iterator>
#include <algorithm>
#include <string>
namespace fs = std::filesystem;
constexpr size_t SizeOfIOStreamBuffer = 1'000'000;
static char ioBuffer[SizeOfIOStreamBuffer];
int main() {
// Path to text file
const fs::path file{ "r:\\test.txt" };
// Open the file and check, if it could be opened
if (std::ifstream fileStream(file); fileStream) {
// Lambda, that identifies, if we have a spece or tab at the begin of a line or not
auto check = [beginOfLine = true](const char c) mutable -> bool {
if (c == ' ' || c == '\t') return !beginOfLine; beginOfLine = (c == '\n'); return true; };
// Huge string with all file data
std::string data{};
// Reserve space to spped up things and to avoid uncessary allocations
data.resize(fs::file_size(file));
// Used buffered IO with a huge iobuffer
fileStream.rdbuf()->pubsetbuf(ioBuffer, SizeOfIOStreamBuffer);
// Read file, Elimiate spaces and tabs at the beginning of the line and store in data
std::copy_if(std::istreambuf_iterator<char>(fileStream), {}, data.begin(), check);
}
return 0;
}
As you can see, all boils done to one statement in the code. And this runs (on my machine) in 160ms for a 100MB file.
What can be optimized further? Of course, we see that we have 2 100MB std::string
s in our software. What a waste. The final optimization would be, to put the 2 statements for file reading and removing spaces and tabs at the beginning of a line , into one statement.
std::copy_if(std::istreambuf_iterator<char>(fileStream), {}, data.begin(), check);
We will have then have only 1 time the data in memory, and eliminate the nonsense that we read data from a file that we do not need. And the beauty of it is that by using modern C++ language elements, only minor modificyations are necessary. Just exchange the source iterators:
Yes, I know that the string size is too big in the end, but it can be set to the actual value easily. For exampe by using data.reserve(...) and back::inserter