Why this approach using putchar_unlocked is slower than printf and cout to print strings?

Question

I'm studying manners of speedup my algorithms for programming's contests, using as base the acceleration of input and output processing.

I'm currently using the thread-unsafe putchar_unlocked function to print in some evaluations. I've thought that this function was faster than cout e printf to some data types if well implemented due to its thread-unsafe nature.

I had implemented a function to print strings in this way (IMHO very simple):

void write_str(char s[], int n){
    int i;
    for(i=0;i<n;i++)
        putchar_unlocked(s[i]);
}

I tested with a string of size n and exactly n characters.
But it is the slowest of the three, as we can see in this chart of number of output writes versus time in seconds:

Why it is the slowest?

How do you obtain `n`? Hard-coded constant? Or using `strlen()`? Also, why don't you use `fputs()` or `fwrite()`? — user12205, Sep 19 '15 at 20:13
@DietmarKühl Actually the graph's title is quite clear (although it's in Portuguese). It reads "Time for writing N character arrays", so I'd say `N` is the number of strings. — Filipe Gonçalves, Sep 19 '15 at 20:14
sorry for the graph language. This results was achieved by printing same string of size _x_ (30, actually). I get 100 execution times for each case and then calculated the average and plotted in the graph — 648trindade, Sep 19 '15 at 20:17
Another silly question: I assume you compiled with optimization? — Dietmar Kühl, Sep 19 '15 at 20:20
Profiling without optimisation is completely silly. You're asking the compiler to stop before it's finished its job. You are literally asking it to produce worse code and to do things more verbosely than it needs to (and it may do so varyingly for different constructs). As a result, your results are pretty meaningless! — Lightness Races in Orbit, Sep 19 '15 at 20:23
Also note that the caller of `putchar_unlocked()` needs to have a lock on `stdout`. You probably should have a calls to `flockfile(stdout)`/`funlockfile(stdout)` around your `for` loop unless whatever is calling `write_str()` is taking that responsibility. — Michael Burr, Sep 19 '15 at 20:44
@MichaelBurr this is really needed? my code is sequential, one-threaded. — 648trindade, Sep 19 '15 at 20:55
@AndrewHenle this benchmarking is for a specific purpose (programming contests) where the ambient are all the same. — 648trindade, Sep 19 '15 at 20:55
@LightnessRacesinOrbit i think that set optimization on compiler reduce redundancy. My code to test is intencionally redundant: stressing the machine doing exactly the same thing over and over. The top-level of functions implementations must do the differences in this case. — 648trindade, Sep 19 '15 at 20:55
Programming contest? It'll be hard to beat replacing your `write_str( char s[], int n )` with a simple `write( 1, s, ( size_t ) n );` — Andrew Henle, Sep 20 '15 at 14:40

score 4 · Answer 1 · answered Sep 19 '15 at 20:39

Assuming the time measurements for up to about 1,000,000 million characters is below a measurement threshold and the writes to std::cout and stdout are made using a form using bulk-writes (e.g. std::cout.write(str, size)), I'd guess that putchar_unlock() spends most of its time actually updating some part of the data structures in addition to putting the character. The other bulk-writes would copy the data into a buffer in bulk (e.g., using memcpy()) and update the data structures internally just once.

That is, the codes would look something like this (this is pidgeon-code, i.e., just roughly showing what's going on; the real code would be, at least, slightly more complicated):

int putchar_unlocked(int c) {
    *stdout->put_pointer++ = c;
    if (stdout->put_pointer != stdout->buffer_end) {
        return c;
    }
    int rc = write(stdout->fd, stdout->buffer_begin, stdout->put_pointer - stdout->buffer_begin);
    // ignore partial writes
    stdout->put_pointer = stdout->buffer_begin;
    return rc == stdout->buffer_size? c: EOF;
}

The bulk-version of the code are instead doing something along the lines of this (using C++ notation as it is easier being a C++ developer; again, this is pidgeon-code):

int std::streambuf::write(char const* s, std::streamsize n) {
    std::lock_guard<std::mutex> guard(this->mutex);
    std::streamsize b = std::min(n, this->epptr() - this->pptr());
    memcpy(this->pptr(), s, b);
    this->pbump(b);
    bool success = true;
    if (this->pptr() == this->epptr()) {
        success = this->this->epptr() - this->pbase()
            != write(this->fd, this->pbase(), this->epptr() - this->pbase();
        // also ignoring partial writes
        this->setp(this->pbase(), this->epptr());
        memcpy(this->pptr(), s + b, n - b);
        this->pbump(n - b);
    }
    return success? n: -1;
}

The second code may look a bit more complicated but is only executed once for 30 characters. A lot of the checking is moved out of the interesting bit. Even if there is some locking done, it is is locking an uncontended mutex and will not inhibit the processing much.

Especially when not doing any profiling the loop using putchar_unlocked() will not be optimized much. In particular, the code won't get vectorized which causes an immediate factor of at least about 3 but probably even closer to 16 on the acutal loop. The cost for the lock will quickly diminish.

BTW, just to create reasonably level playground: aside from optimizing you should also call std::sync_with_stdio(false) when using C++ standard stream objects.

score 2 · Accepted Answer · answered Sep 19 '15 at 20:31

Choosing the faster way to output strings comes into conflict with the platform, operating system, compiler settings and runtime library in use, but there are some generalizations which may help understand what to select.

First, consider that the operating system may have a means of display strings as compared to characters one at a time, and if so, looping through a system call for character output one at a time would naturally invoke overhead for every call to the system, as opposed to the overhead of one system call processing a character array.

That's basically what you're encountering, the overhead of a system call.

The performance enhancement of putchar_unlocked, compared to putchar, may be considerable, but only between those two functions. Further, most runtime libraries do not have putchar_unlocked (I find it on older MAC OS X documentation, but not Linux or Windows).

That said, locked or unlocked, there would still be overhead for each character that may be eliminated for a system call processing the entire character array, and such notions extend to output to files or other devices, not just the console.

Hi @Jvene This overhead occours in the input too? Because *getchar_unlocked* wins *scanf* and *cin* in all cases of input processing. I'm using debian testing and gcc 5.2.1 (g++ too). — 648trindade, Sep 19 '15 at 20:39
Conceptually, scanf has a LOT of work to do which getchar (any flavor) does not. Look at the source of scanf to see why. Further, context matters, and just how you're evaluating getchar compared to cin. Are you taking input from a pipe at the command line? If so, what string processing is used? You may find there's more in the string processing duties of cin than you define with getchar. That is to say, the context of calling an OUTPUT function is entirely different than calling an INPUT function, especially one like scanf. — JVene, Sep 19 '15 at 20:45

score 1 · Answer 3 · answered Sep 20 '15 at 02:10

1

My personal guess is that printf() does it in chunks, and only has to pass the app/kernel boundary occasionally for each chunk.

putchar_unlocked() does it for every byte written.

answered Sep 20 '15 at 02:10

Russ Schultz

2,545
20
22

Isn't `stdout` line buffered (for terminal) or block buffered unless you configured otherwise? – JiaHao Xu Sep 21 '21 at 09:13

Why this approach using putchar_unlocked is slower than printf and cout to print strings?

3 Answers3