Golang TCP Connection Slow when Server Doesn't Have Backlog Available

Question

Update

With the added pthreaded C client, the problem is recreated, indicating the long connection times are part of the TCP protocol, rather than specific implementations. Altering those protocols doesn't seem easily available.

Initial Question

I believe my question is largely: What does the Golang net package do when attempting to connect to a server over TCP and:

The server has no connections available, even in backlog.
The connection is not refused/failed.

There seems to be a large amount of overhead in that connection with server response times ramping up from 5 ms to several seconds. This was seen both in a production environment and in the minimal example below. The proper solution is to use connection pools to the server, which will be implemented. This is largely my curiosity.

To reproduce:

Run server with backlog = 1, run client.go.
- All 50 goroutines fire at once, with a total completion time of almost 2 minutes.
Run server with backlog = 100, run client.go.
- All 50 goroutines fire at once, queue up connected to the server, and complete in ~260 ms.
Running three C clients utilizing 50 us retry times was able to complete connections within 12 ms on average, so didn't see this issue.

Example output for backlog = 1 (first time is time to dial, second is time to completion):

user@computer ~/tcp-tests $ go run client.go 127.0.0.1:46999
Long Elapsed Time: 216.579µs, 315.196µs
Long Elapsed Time: 274.169µs, 5.970873ms
Long Elapsed Time: 74.4µs, 10.753871ms
Long Elapsed Time: 590.965µs, 205.851066ms
Long Elapsed Time: 1.029287689s, 1.029574065s
Long Elapsed Time: 1.02945649s, 1.035098229s
...
Long Elapsed Time: 3.045881865s, 6.378597166s
Long Elapsed Time: 3.045314838s, 6.383783688s
Time taken stats: 2.85 +/- 1.59 s // average +/- STDEV
Main Taken: 6.384677948s

Example output for backlog = 100:

...
Long Elapsed Time: 330.098µs, 251.004077ms
Long Elapsed Time: 298.146µs, 256.187795ms
Long Elapsed Time: 315.832µs, 261.523685ms
Time taken stats: 0.13 +/- 0.08 s
Main Taken: 263.186955ms

So what's going on under the hood of net.DialTCP (we used other flavors of dial as well, with no discernible difference) that causes the dial time to grow?

Polling time between attempts to make a connection?
An RFC 5681 Global Congestion Control (likely including mutex lock?) variable that gets incremented on all the initial failed connection attempts?
Something else?

I'm leaning towards the first two, as the 1s, 3s, 5s values seem to be magic numbers. They show up both on my modest local machine, and a large scale production environment.

Here is the minimal server written in C. The configuration value of interest is the backlog argument to listen.

/*
    Adapted from
    https://www.geeksforgeeks.org/tcp-server-client-implementation-in-c/
    
    Compile and run with:
        gcc server.c -o server; ./server
*/
#include <stdio.h>
#include <string.h>
#include <sys/socket.h>
#include <arpa/inet.h>
#include <unistd.h>
#include <sys/time.h>

int main(void)
{
    int socket_desc, client_sock, client_size;
    struct sockaddr_in server_addr, client_addr;
    char server_message[2000], client_message[2000];

    // Clean buffers:
    memset(server_message, '\0', sizeof(server_message));
    memset(client_message, '\0', sizeof(client_message));

    // Create socket:
    socket_desc = socket(AF_INET, SOCK_STREAM, 0);

    if(socket_desc < 0){
        printf("Error while creating socket\n");
        return -1;
    }
    printf("Socket created successfully\n");

    // Set port and IP:
    server_addr.sin_family = AF_INET;
    server_addr.sin_port = htons(46999);
    server_addr.sin_addr.s_addr = inet_addr("127.0.0.1");

    // Bind to the set port and IP:
    if(bind(socket_desc, (struct sockaddr*)&server_addr, sizeof(server_addr))<0){
        printf("Couldn't bind to the port\n");
        return -1;
    }
    printf("Done with binding\n");

    // Listen for clients:
    // Increasing the backlog allows the Go client to connect and wait
    // rather than poll/retry.
    if(listen(socket_desc, 100) < 0){
        printf("Error while listening\n");
        return -1;
    }
    printf("\nListening for incoming connections.....\n");

    // Accept an incoming connection:
    client_size = sizeof(client_addr);
    int server_run = 1;
    do
    {
        struct timeval start, end;
        double cpu_time_used;
        gettimeofday(&start, NULL);
        client_sock = accept(socket_desc, (struct sockaddr*)&client_addr, &client_size);

        if (client_sock < 0){
            printf("Can't accept\n");
            return -1;
        }

        // Receive client's message:
        if (recv(client_sock, client_message, sizeof(client_message), 0) < 0){
            printf("Couldn't receive\n");
            return -1;
        }
        if (strcmp(client_message, "stop") == 0)
        {
            server_run = 0;
            printf("Received stop message.\n");
        }

        // Respond to client:
        strcpy(server_message, "This is the server's message.");

        if (send(client_sock, server_message, strlen(server_message), 0) < 0){
            printf("Can't send\n");
            return -1;
        }

        // sleep for 5 ms
        usleep(5000);

        // Closing the socket:
        close(client_sock);
        gettimeofday(&end, NULL);
        cpu_time_used = (end.tv_usec - start.tv_usec) / 1000.0;
        if (cpu_time_used > 0.0) // overflow in tv_usec if negative
            printf("Server Time: %.4f ms\n", cpu_time_used);
    } while(server_run);

    close(socket_desc);

    return 0;
}

Here is the testing Go client

/*
    Adapted from
    https://www.linode.com/docs/guides/developing-udp-and-tcp-clients-and-servers-in-go/

    Run once the server.c is compiled and running with:
        go run client.go 127.0.0.1:46999
*/
package main

import (
    "fmt"
    "net"
    "os"
    "time"
    "github.com/montanaflynn/stats"
    "sync"
)
func do_message(wg *sync.WaitGroup, connect string, time_taken *float64) {
    defer wg.Done()
    message := make([]byte, 128)
    start_time := time.Now()
    pAddr, err := net.ResolveTCPAddr("tcp", connect)
    if err != nil {
        return
    }

    c, err := net.DialTCP("tcp", nil, pAddr)
    if err != nil {
        fmt.Println(err)
        return
    }
    c.SetLinger(0)
    dialed_time := time.Since(start_time)

    defer func() {
        c.Close()
        elapsed_time := time.Since(start_time)
        if elapsed_time.Microseconds() > 60 { // microseconds
            fmt.Println("Long Elapsed Time: " + dialed_time.String() + ", " + elapsed_time.String())
        }
        *time_taken = float64(elapsed_time.Microseconds())
    }()

    text := "{\"service\": \"magic_service_str\"}"
    c.Write([]byte(text))
    code, _ := c.Read(message) // Does not actually wait for response.
    code = code
}
func main() {
    main_start := time.Now()
    arguments := os.Args
    if len(arguments) == 1 {
            fmt.Println("Please provide host:port.")
            return
    }
    n_messages := 50
    wg := new(sync.WaitGroup)
    wg.Add(n_messages)
    times := make([]float64, n_messages)
    for i := 0; i < n_messages; i++ {
        // Used to turn the goroutines into serial implementation
        // time.Sleep(5500 * time.Microsecond)
        go do_message(wg, arguments[1], &times[i])
    }
    wg.Wait()
    avg, _ := stats.Mean(times)
    std, _ := stats.StandardDeviation(times)
    fmt.Println("Time taken stats: " + fmt.Sprintf("%.2f", avg / 1000000.0) + " +/- " + fmt.Sprintf("%.2f", std / 1000000.0) + " s")
    main_taken := time.Since(main_start)
    fmt.Println("Main Taken: " + main_taken.String())
}

Updated pthreaded client in C and confirmed the issue is not the Golang implementation:

// gcc client_p.c -o pclient -lpthread
#include <stdio.h>
#include <string.h>
#include <sys/socket.h>
#include <arpa/inet.h>
#include <unistd.h>
#include <stdlib.h>
#include<sys/time.h>


#include <pthread.h>
#include <errno.h>

#ifndef THREAD_LOOP_COUNT
#define THREAD_LOOP_COUNT 1
#endif

/* Subtract the ‘struct timeval’ values X and Y,
   storing the result in RESULT.
   Return 1 if the difference is negative, otherwise 0.
   https://www.gnu.org/software/libc/manual/html_node/Calculating-Elapsed-Time.html
*/

int
timeval_subtract (struct timeval *result, struct timeval *x, struct timeval *y)
{
  /* Perform the carry for the later subtraction by updating y. */
  if (x->tv_usec < y->tv_usec) {
    int nsec = (y->tv_usec - x->tv_usec) / 1000000 + 1;
    y->tv_usec -= 1000000 * nsec;
    y->tv_sec += nsec;
  }
  if (x->tv_usec - y->tv_usec > 1000000) {
    int nsec = (x->tv_usec - y->tv_usec) / 1000000;
    y->tv_usec += 1000000 * nsec;
    y->tv_sec -= nsec;
  }

  /* Compute the time remaining to wait.
     tv_usec is certainly positive. */
  result->tv_sec = x->tv_sec - y->tv_sec;
  result->tv_usec = x->tv_usec - y->tv_usec;

  /* Return 1 if result is negative. */
  return x->tv_sec < y->tv_sec;
}


static void* workerThreadFunc(void* arg)
{
    int socket_desc;
    struct sockaddr_in server_addr;
    char server_message[2000], client_message[2000];
    // Clean buffers:
    memset(server_message,'\0',sizeof(server_message));
    memset(client_message,'\0',sizeof(client_message));
    // Set port and IP the same as server-side:
    server_addr.sin_family = AF_INET;
    server_addr.sin_port = htons(46999);
    server_addr.sin_addr.s_addr = inet_addr("127.0.0.1");
    int retries = 0;
    struct timeval start, end, difference;
    double cpu_time_used;
    for(int i = 0; i < THREAD_LOOP_COUNT; i++)
    {
        gettimeofday(&start, NULL);
        // Create socket:
        socket_desc = socket(AF_INET, SOCK_STREAM, 0);
        if(socket_desc < 0){
            printf("Unable to create socket\n");
            return;
        }
        // Send connection request to server:
        while(connect(socket_desc, (struct sockaddr*)&server_addr, sizeof(server_addr)) < 0){
            retries++;
            if (retries > 10)
            {
                printf("Unable to connect\n");
                retries = 0;
            }
            usleep(5);
        }
        int retries = 0;

        // Send the message to server:
        if(send(socket_desc, client_message, strlen("client message."), 0) < 0){
            printf("Unable to send message\n");
            close(socket_desc);
            return;
        }

        // Receive the server's response:
        if(recv(socket_desc, server_message, sizeof(server_message), 0) < 0){
            printf("Error while receiving server's msg\n");
            close(socket_desc);
            return;
        }

        // Close the socket:
        close(socket_desc);
        gettimeofday(&end, NULL);
        timeval_subtract (&difference, &end, &start);
        double cpu_time_used = (double)difference.tv_sec + (double)difference.tv_usec / 1000000.0;
        printf("Client Time: %.4e s\n", cpu_time_used);
    }
}

int main(int argc, char **argv)
{
    int n_threads = 50;  // default value
    if (argc > 1)
        n_threads = atoi(argv[1]);

    pthread_t *threads = (pthread_t*)malloc(n_threads * sizeof(pthread_t));

    struct timeval start, end, difference;
    gettimeofday(&start, NULL);
    for(int i = 0; i < n_threads; i++)
    {
        int createRet = pthread_create(&threads[i], NULL, workerThreadFunc, NULL);
        if (createRet != 0)
        {
            printf("failed to create thread\n");
        }
    }
    for(int i = 0; i < n_threads; i++)
        pthread_join(threads[i], NULL);
    gettimeofday(&end, NULL);
    timeval_subtract (&difference, &end, &start);
    double cpu_time_used = (double)difference.tv_sec + (double)difference.tv_usec / 1000000.0;
    printf("Total Client Time: %.4e s\n", cpu_time_used);
    free(threads);
    return 0;
}

It has nothing to do with Golang. The behaviour is implemented in the kernel at the server side, and it is to either reset or ignore the connection depending on whether it is or isn't Windows. That in turn will cause the client `connect()` call to time out and retry with double the timeout, up to 3 attempts. In total this can take a minute or too. The real solution is to speed up the server's `accept()` loop and make it independent of all I/O with accepted clients. — user207421, Apr 17 '22 at 07:23
@user207421 So is the kernel on the server side retrying a connection to the server process, and that gets timedout with a doubling? — bbg, Apr 17 '22 at 15:08
It is the client side that does the retries. I already said that. — user207421, Apr 18 '22 at 00:24
@user207421 Right, but you also said it had nothing to do with Golang, which is the client side. Is there a way to alter that exponential backoff via Golang/C on the client side? — bbg, Apr 18 '22 at 16:49
The client side also consists of the C library, an operating system, a computer, and a network interface. The retries I mentiioned occur within the C library `connect()` call. I am not aware of any way to control them. — user207421, Apr 20 '22 at 01:21

score 0 · Accepted Answer · answered Apr 22 '22 at 23:09

As indicated by @user207421, the issue lies in the TCP implementation, which includes an exponential backoff on retries. Neither Golang nor C appear to have an easy way to alter this behavior.

The answer is: Don't open and close connections of TCP if you high throughput--use a connection pool.

There was some work looking at removing the exponential backoff, linked below, but there is likely a better solution for specific cases. There was for me.

ACM SIGCOMM Computer Communication Review, "Removing Exponential Backoff from TCP", Volume 38, Number 5, October 2008.

Golang TCP Connection Slow when Server Doesn't Have Backlog Available

1 Answers1