15

Go, and C both involve system calls directly (Technically, C will call a stub).

Technically, write is both a system call and a C function (at least on many systems). However, the C function is just a stub which invokes the system call. Go does not call this stub, it invokes the system call directly, which means that C is not involved here

From Differences between C write call and Go syscall.Write

My benchmark shows, pure C system call is 15.82% faster than pure Go system call in the latest release (go1.11).

What did I miss? What could be a reason and how to optimize them?

Benchmarks:

Go:

package main_test

import (
    "syscall"
    "testing"
)

func writeAll(fd int, buf []byte) error {
    for len(buf) > 0 {
        n, err := syscall.Write(fd, buf)
        if n < 0 {
            return err
        }
        buf = buf[n:]
    }
    return nil
}

func BenchmarkReadWriteGoCalls(b *testing.B) {
    fds, _ := syscall.Socketpair(syscall.AF_UNIX, syscall.SOCK_STREAM, 0)
    message := "hello, world!"
    buffer := make([]byte, 13)
    for i := 0; i < b.N; i++ {
        writeAll(fds[0], []byte(message))
        syscall.Read(fds[1], buffer)
    }
}

C:

#include <time.h>
#include <stdio.h>
#include <unistd.h>
#include <sys/socket.h>

int write_all(int fd, void* buffer, size_t length) {
    while (length > 0) {
        int written = write(fd, buffer, length);
        if (written < 0)
            return -1;
        length -= written;
        buffer += written;
    }
    return length;
}

int read_call(int fd, void *buffer, size_t length) {
    return read(fd, buffer, length);
}

struct timespec timer_start(){
    struct timespec start_time;
    clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &start_time);
    return start_time;
}

long timer_end(struct timespec start_time){
    struct timespec end_time;
    clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &end_time);
    long diffInNanos = (end_time.tv_sec - start_time.tv_sec) * (long)1e9 + (end_time.tv_nsec - start_time.tv_nsec);
    return diffInNanos;
}

int main() {
    int i = 0;
    int N = 500000;
    int fds[2];
    char message[14] = "hello, world!\0";
    char buffer[14] = {0};

    socketpair(AF_UNIX, SOCK_STREAM, 0, fds);
    struct timespec vartime = timer_start();
    for(i = 0; i < N; i++) {
        write_all(fds[0], message, sizeof(message));
        read_call(fds[1], buffer, 14);
    }
    long time_elapsed_nanos = timer_end(vartime);
    printf("BenchmarkReadWritePureCCalls\t%d\t%.2ld ns/op\n", N, time_elapsed_nanos/N);
}

340 different running, each C running contains 500000 executions, and each Go running contains b.N executions (mostly 500000, few times executed in 1000000 times):

enter image description here

T-Test for 2 Independent Means: The t-value is -22.45426. The p-value is < .00001. The result is significant at p < .05.

enter image description here

T-Test Calculator for 2 Dependent Means: The value of t is 15.902782. The value of p is < 0.00001. The result is significant at p ≤ 0.05.

enter image description here


Update: I managed the proposal in the answers and wrote another benchmark, it shows the proposed approach significantly drops the performance of massive I/O calls, its performance close to CGO calls.

Benchmark:

func BenchmarkReadWriteNetCalls(b *testing.B) {
    cs, _ := socketpair()
    message := "hello, world!"
    buffer := make([]byte, 13)
    for i := 0; i < b.N; i++ {
        cs[0].Write([]byte(message))
        cs[1].Read(buffer)
    }
}

func socketpair() (conns [2]net.Conn, err error) {
    fds, err := syscall.Socketpair(syscall.AF_LOCAL, syscall.SOCK_STREAM, 0)
    if err != nil {
        return
    }
    conns[0], err = fdToFileConn(fds[0])
    if err != nil {
        return
    }
    conns[1], err = fdToFileConn(fds[1])
    if err != nil {
        conns[0].Close()
        return
    }
    return
}

func fdToFileConn(fd int) (net.Conn, error) {
    f := os.NewFile(uintptr(fd), "")
    defer f.Close()
    return net.FileConn(f)
}

enter image description here

The above figure shows, 100 different running, each C running contains 500000 executions, and each Go running contains b.N executions (mostly 500000, few times executed in 1000000 times)

Changkun
  • 1,502
  • 1
  • 14
  • 29
  • 4
    What do you mean by a "raw systemcall"? Can you please elaborate? And please try to make your question self-contained by adding a [Minimal, Complete, and Verifiable Example](http://stackoverflow.com/help/mcve) or two. – Some programmer dude Sep 12 '18 at 14:03
  • @JimB: It states that pure C calls are only 2244 ns/op though. You're looking at CGo (using native C interfaces from within Go) vs. Go, not Go vs. C. – ShadowRanger Sep 12 '18 at 14:07
  • @JimB we are comparing C and Go, not Cgo. – Changkun Sep 12 '18 at 14:10
  • 1
    And how many times did you run your benchmarks? How do you know how many times your test processes were swapped out during a test, or switched to another CPU core? – Andrew Henle Sep 12 '18 at 14:10
  • 2
    Maybe because you're comparing apples to pears? – Jabberwocky Sep 12 '18 at 14:15
  • @Jabberwocky Am I? Why? – Changkun Sep 12 '18 at 14:16
  • @ChangkunOu not sure, but mabe the overhead is different? Perhaps you need to clarify the question and add more background information. – Jabberwocky Sep 12 '18 at 14:18
  • @AndrewHenle You should read the benchmark and running script. – Changkun Sep 12 '18 at 14:19
  • @Someprogrammerdude You are saying nothing helpful. The code is provided in the link. You literally can run the benchmark with two lines of command (`git clone` and `sh run.sh`) – Changkun Sep 12 '18 at 14:21
  • @Jabberwocky I added more background information regarding C system call and Go syscall in the description. – Changkun Sep 12 '18 at 14:24
  • 1
    *You should read the benchmark and running script.* I did. So tell me, how many times did your benchmark processes get swapped out or switched to another CPU while running? If you're going to benchmark things like this, you have to account for context switches and everything else that can pollute your results. – Andrew Henle Sep 12 '18 at 14:32
  • @AndrewHenle _how many times did your benchmark processes get swapped out or switched to another CPU while running?_ You can profiling the benchmark your self. If I can do anything myself, why am I seeking insights here? The final result is an average based on `500000` times execution as you can see from the output of the benchmark. The average shows C system call is significantly faster than Go syscall. – Changkun Sep 12 '18 at 14:38
  • 3
    Links can go stale, change, or disappear completely. That's why questions should be *self contained*. Otherwise they run the risk of becoming worthless. It's not only about helping you now, but helping other programmers with similar (or the same) problem in the future as well. If you need, then please (re-)read about [how to ask good questions](http://stackoverflow.com/help/how-to-ask), as well as [this question checklist](https://codeblog.jonskeet.uk/2012/11/24/stack-overflow-question-checklist/). – Some programmer dude Sep 12 '18 at 14:43
  • @Someprogrammerdude Thank you very much. I've added all benchmarks in the question description, please help, thanks again. – Changkun Sep 12 '18 at 14:46
  • 2
    *You can profiling the benchmark your self.* I'm not the one **asking** for help. The information you have given does not "shows C system call is significantly faster than Go syscall" at all. That may be true, but your data doesn't establish that. Something that fast can be perturbed by a lot of things the data you have collected doesn't account for. You can try to account for unknowns by running your benchmarks many times to see if the results are significant and consistent. – Andrew Henle Sep 12 '18 at 15:01
  • @AndrewHenle Well. Chart is added. – Changkun Sep 12 '18 at 15:32

1 Answers1

20

My benchmark shows, pure C system call is 15.82% faster than pure Go system call in the latest release (go1.11).

What did I miss? What could be a reason and how to optimize them?

The reason is that while both C and Go (on a typical platform Go supports—such as Linux or *BSD or Windows) are compiled down to machine code, Go-native code runs in an environment quite different from that of C.

The two chief differences to C are:

  • Go code runs in the context of so-called goroutines which are freely scheduled by the Go runtime on different OS threads.
  • Goroutines use their own (growable and reallocatable) lightweight stacks which have nothing to do with the OS-supplied stack C code uses.

So, when Go code wants to make a syscall, quite a lot should happen:

  1. The goroutine which is about to enter a syscall must be "pinned" to the OS thread on which it's currently running.
  2. The execution must be switched to use the OS-supplied C stack.
  3. The necessary preparation in the Go runtime's scheduler are made.
  4. The goroutine enters the syscall.
  5. Upon exiting the execution of the goroutine has to be resumed, which is a relatively involved process in itself which may be additionaly hampered if the goroutine was in the syscall for too long and the scheduler removed the so-called "processor" from under that goroutine, spawned another OS thread and made that processor run another goroutine ("processors", or Ps are thingies which run goroutines on OS threads).

Update to answer the OP's comment

<…> Thus there is no way to optimize and I must suffer that if I make massive IO calls, mustn't I?

It heavily depends on the nature of the "massive I/O" you're after.

If your example (with socketpair(2)) is not toy, there is simply no reason to use syscalls directly: the FDs returned by socketpair(2) are "pollable" and hence the Go runtime may use its native "netpoller" machinery to perform I/O on them. Here is a working code from one of my projects which properly "wraps" FDs produced by socketpair(2) so that they can be used as "regular" sockets (produced by functions from the net standard package):

func socketpair() (net.Conn, net.Conn, error) {
       fds, err := syscall.Socketpair(syscall.AF_LOCAL, syscall.SOCK_STREAM, 0)
       if err != nil {
               return nil, nil, err
       }

       c1, err := fdToFileConn(fds[0])
       if err != nil {
               return nil, nil, err
       }

       c2, err := fdToFileConn(fds[1])
       if err != nil {
               c1.Close()
               return nil, nil, err
       }

       return c1, c2, nil
}

func fdToFileConn(fd int) (net.Conn, error) {
       f := os.NewFile(uintptr(fd), "")
       defer f.Close()
       return net.FileConn(f)
}

If you're talking about some other sort of I/O, the answer is that yes, syscalls are not really cheap and if you must do lots of them, there are ways to work around their cost (such as offloading to some C code—linked in or hooked up as an external process—which would somehow batch them so that each call to that C code would result in several syscalls done by the C side).


See also.

kostix
  • 51,517
  • 14
  • 93
  • 176
  • A great explanation! I appreciate a lot. Thus there is no way to optimize and I must suffer that if I make massive IO calls, mustn't I? – Changkun Sep 12 '18 at 15:43
  • Updated my answer to address your comment. – kostix Sep 12 '18 at 16:05
  • Hi, I managed your proposal, however, the benchmark shows the performance not even close to pure system calls (in description updates). Is there anything wrong with my benchmark? – Changkun Sep 17 '18 at 06:11
  • Both the question and answer already accreted too much additional information, clearly displaying SO is not fit for solving it. Please post the question to [the mailing list](https://groups.google.com/d/forum/golang-nuts) instead. Please be sure to post _the question_ (complete with a short summary of the insight provided by others, and your concerns left unsolved), not just drop a link to this discussion. Thanks. – kostix Sep 17 '18 at 17:41