Assuming the time measurements for up to about 1,000,000 million characters is below a measurement threshold and the writes to std::cout
and stdout
are made using a form using bulk-writes (e.g. std::cout.write(str, size)
), I'd guess that putchar_unlock()
spends most of its time actually updating some part of the data structures in addition to putting the character. The other bulk-writes would copy the data into a buffer in bulk (e.g., using memcpy()
) and update the data structures internally just once.
That is, the codes would look something like this (this is pidgeon-code, i.e., just roughly showing what's going on; the real code would be, at least, slightly more complicated):
int putchar_unlocked(int c) {
*stdout->put_pointer++ = c;
if (stdout->put_pointer != stdout->buffer_end) {
return c;
}
int rc = write(stdout->fd, stdout->buffer_begin, stdout->put_pointer - stdout->buffer_begin);
// ignore partial writes
stdout->put_pointer = stdout->buffer_begin;
return rc == stdout->buffer_size? c: EOF;
}
The bulk-version of the code are instead doing something along the lines of this (using C++ notation as it is easier being a C++ developer; again, this is pidgeon-code):
int std::streambuf::write(char const* s, std::streamsize n) {
std::lock_guard<std::mutex> guard(this->mutex);
std::streamsize b = std::min(n, this->epptr() - this->pptr());
memcpy(this->pptr(), s, b);
this->pbump(b);
bool success = true;
if (this->pptr() == this->epptr()) {
success = this->this->epptr() - this->pbase()
!= write(this->fd, this->pbase(), this->epptr() - this->pbase();
// also ignoring partial writes
this->setp(this->pbase(), this->epptr());
memcpy(this->pptr(), s + b, n - b);
this->pbump(n - b);
}
return success? n: -1;
}
The second code may look a bit more complicated but is only executed once for 30 characters. A lot of the checking is moved out of the interesting bit. Even if there is some locking done, it is is locking an uncontended mutex and will not inhibit the processing much.
Especially when not doing any profiling the loop using putchar_unlocked()
will not be optimized much. In particular, the code won't get vectorized which causes an immediate factor of at least about 3 but probably even closer to 16 on the acutal loop. The cost for the lock will quickly diminish.
BTW, just to create reasonably level playground: aside from optimizing you should also call std::sync_with_stdio(false)
when using C++ standard stream objects.