4

I have a client server program written in C. The intent is to see how fast big data can be trasported over TCP. The receiving side OS (Ubuntu Linux 14.*) is tuned to improve the TCP performance, as per the documentation around tcp / socket / windows scaling etc. as below:

net.ipv4.tcp_window_scaling = 1
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.ipv4.tcp_rmem = 4096 87380 16777216
net.ipv4.tcp_wmem = 4096 16384 16777216

This aprt, I have also increased the individual socket buffer size through setsockopt call.

But I am not seeing the program responding to these changes - the overall throughput is either flat or even reduced at times. When I took tcpdump at the receiving side, I see a monotonic pattern of tcp packets of length 1368 as coming to it, in most (99%) cases.

19:26:06.531968 IP <SRC> > <DEST>: Flags [.], seq 25993:27361, ack 63, win 57, options [nop,nop,TS val 196975830 ecr 488095483], length 1368

As per the documentation, the tcp window scaling option increases the receiving frame size in propotion to the demand and capacity - but all I see is "win 57" - very few bytes remaining in the receiving buffer, which is not matching with the expection.

Hence I start suspecting my assumptions on the tuning itself, and have these questions:

  1. Is there any specific tunables required at the sending side to improve the client side reception? Making sure that you (program) writes the whole chunk of data in one go is not enough?

  2. In the client side tunable as mentioned above necessary and sufficient? The default on in the system are too low, but I don't see the changes applied in /etc/sysctl.conf having any effect. Is running sysctl --system after changes sufficient to make the changes in effect? or do we need to reboot the system?

  3. If the OS is a virtual machine, will these tunables make meaning in its completeness, or are there additional steps at the real physical machine?

I can share the source code if that helps, but I can guarentee that it is just a trivial code.

Here is the code:

#cat client.c 

#include <stdio.h>
#include <sys/types.h>
#include <sys/socket.h>
#include <netinet/in.h>
#include <netinet/tcp.h>
#include <unistd.h>
#include <arpa/inet.h>
#include <string.h>

#define size 1024 * 1024 * 32
int main(){

  int s;
  char buffer[size];
  struct sockaddr_in sa;
  socklen_t addr_size;

  s = socket(PF_INET, SOCK_STREAM, 0);

  sa.sin_family = AF_INET;
  sa.sin_port = htons(25000);
  sa.sin_addr.s_addr = inet_addr("<SERVERIP");
  memset(sa.sin_zero, '\0', sizeof sa.sin_zero);  

  addr_size = sizeof sa;
  connect(s, (struct sockaddr *) &sa, addr_size);

int rbl = 1048576;
int  g = setsockopt(s, SOL_SOCKET, SO_RCVBUF, &rbl, sizeof(rbl));

while(1) {
  int ret = read(s, buffer, size);
  if(ret <= 0) break; 
}
return 0;
}

And the server code:

bash-4.1$ cat server.c

#include <sys/types.h>
#include <sys/mman.h>
#include <memory.h>
#include <stdlib.h>
#include <fcntl.h>
#include <sys/stat.h>
#include <netinet/in.h>
#include <errno.h>
#include <stdio.h>
#include <sys/socket.h>

extern int errno;

#define size 32 * 1024 * 1024

int main() {

int fdsocket;
struct sockaddr_in sock;

fdsocket = socket(AF_INET,SOCK_STREAM, 0);
int rbl = 1048576;
int g = setsockopt(fdsocket, SOL_SOCKET, SO_SNDBUF, &rbl, sizeof(rbl));
sock.sin_family = AF_INET;
sock.sin_addr.s_addr = inet_addr("<SERVERIP");
sock.sin_port = htons(25000);
memset(sock.sin_zero, '\0', sizeof sock.sin_zero);
g = bind(fdsocket, (struct sockaddr *) &sock, sizeof(sock));
if(g == -1) {
fprintf(stderr, "bind error: %d\n", errno);
exit(1);
}
int p = listen(fdsocket, 1);
char *buffer = (char *) mmap(NULL, size, PROT_WRITE|PROT_READ, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
if(buffer == -1) { 
fprintf(stderr, "%d\n", errno);
exit(-1);
}
memset(buffer, 0xc, size);
int connfd = accept(fdsocket, (struct sockaddr*)NULL, NULL);
rbl = 1048576;
  g = setsockopt(connfd, SOL_SOCKET, SO_SNDBUF, &rbl, sizeof(rbl));
  int wr = write(connfd, buffer, size);
  close(connfd);
}
Gireesh Punathil
  • 1,344
  • 8
  • 18
  • 1
    In order to get a helpful answer you are going need to provide more details about your setup and what problems you are experiencing. For example "I have two machines with 10gbit/s nics connected to two 10gbit/s switch ports on the same switch, but I can't seem to get more that 5gbit/s throughput" will have very different answers from "I have a client and a server connected via a wan with a 600ms RTT. Is there anything I can do to improve throughput." – JimD. Jan 30 '17 at 06:00
  • thanks for the comment - While all such environmental factor remain same, the change in OS tunables do not make any effect - can't this in itself be considered as a problem? – Gireesh Punathil Jan 30 '17 at 06:48
  • 1
    It appears that you are trying to increase the window size of the receiver in an attempt so speed up the transfer. Did this even cause the receiver to advertise a larger window size? This would only help if the window size is a limiting factor. If you had a network capture before you made the changes and you looked at it with Wireshark filter tcp.analysis.flags and you saw zero window advertisements, then, yes, increasing the window size will help. Best thing you can do is get a network capture and look at the IO graph and tcp.analysis.flags in Wireshark and see what can be improved. – JimD. Jan 30 '17 at 06:57
  • thanks for your suggestion Jim - let me try that out. I was thinking that the "win 57.." messages from the tcpdump essentially covers the same, but I guess wireshark observations can be more comprehensive. thanks once again. – Gireesh Punathil Jan 30 '17 at 07:10
  • 1
    The `win 57` is probably unscaled: see [wireshark and tcpdump -r: strange tcp window sizes](http://stackoverflow.com/questions/3254574/wireshark-and-tcpdump-r-strange-tcp-window-sizes). – JimD. Jan 30 '17 at 07:11
  • 1
    What's the return value for your server's `write()` call? Did it really send the entire 32 MB? (Note also that `write()` returns `ssize_t`, not `int`. They are *not* the same.) Also, how do you know your `setsockopt()` calls actually worked to increase your socket buffer size? You ignore the return value. – Andrew Henle Jan 30 '17 at 12:29
  • Andrew, I see ssize_t originates thus: typedef long int __ssize_t; and in 32 bit system int and ssize_t are same. In terms of socket buffer size - actually live debugged the code to make sure that each call goes through. – Gireesh Punathil Jan 30 '17 at 16:42
  • thanks Jim!! that was a great and useful finding. Yes, I checked the tcpdump file, and found out that I have window scaling enabled with a factor of 9! so these numbers (win 57) are quite big - 29184 bytes I guess. So that addresses a great deal of my original problem, now it is all about seeing the performance gain in a consistent manner. I will continue to see if I miss something, but any further suggestions are welcome! – Gireesh Punathil Jan 30 '17 at 16:46
  • 1
    @GireeshPunathil * I see ssize_t originates thus: typedef long int __ssize_t; and in 32 bit system int and ssize_t are same.* I see you have no experience writing either portable or 64-bit code. If ydd something like `-m64` to your *completely separate and unrelated* compiler options and your code *breaks*. Again, `ssize_t` and `int` are not the same. There's a reason why the standard specifies `ssize_t` and not `int` or `long int`. Ignore it, and you are writing at best non-portable and at worst broken and buggy code. – Andrew Henle Feb 01 '17 at 01:40
  • @AndrewHenle - (1) fair enough, since your last post, I changed that line to ssize_t. (2) As the writer is a blocking socket, there is no reason for it to return premature under normal circumstances. (3) I have no reason / evidence that suggests write side is a problem. The main theme of my question is centered around what client can do better to receive data faster than the current pace - in 2 forms: (a) at source level, while the environmental factors including n/w, os etc. remain same, (b) what additional tunables can be applied at OS to compliment the intention of source and use case. – Gireesh Punathil Feb 01 '17 at 05:46

1 Answers1

2
  1. There are many tunables, but whether they have an effect and whether the effect is positive or negative also depends on the situation. What are the defaults for the tunables? The values you set might actually be lower than the defaults on your OS, thereby decreasing performance. But larger buffers might sometimes also be detrimental, because more RAM is used, and it might not fit into cache memory anymore. It also depends on your network itself. Is it wired, wireless, how many hops, what kind of routers are inbetween? But sending data in as large chunks as possible is usually the right thing to do.

    One tunable you have missed is the congestion control algorithm, which you can tune with net.ipv4.tcp_congestion_control. Which ones are available depends on your kernel, and which one is the best depends on your network and the kind of traffic that you are sending.

    Another thing is that TCP has two endpoints, and tunables on both sides are important.

  2. The changes made with sysctl are taking effect immediately for new TCP connections.

  3. The TCP parameters only have effect on the endpoints of a TCP connection. So you don't have to change them on the VM host. But running in a guest means that the packets it sends still need to be processed by the host in some way (if only just to forward them to the real physical network interface). It will always be slower to run your test from inside a virtual machine than if you'd run it on a physical machine.

What I'm missing is any benchmark numbers that you can compare with the actual network speed. Is there room for improvement at all? Maybe you are already at the maximum speed that is possible? In that case no amount of tuning will help. Note that the defaults are normally very reasonable.

G. Sliepen
  • 7,637
  • 1
  • 15
  • 31
  • thanks Sliepen for useful info - I will wait for few days before I accept your answer. In terms of benchmarks - "win 57" and the packet size 1368 in the tcpdump makes it obvious that it is not running optimally. In terms of virtualness of the OS - agreed, but my comparison with and without the tunables use the same host, so should show some difference, which it is not, which is the cause of my worry. – Gireesh Punathil Jan 29 '17 at 12:49
  • I can't tell from just the window and packet size whether your tunables are optimal. Perhaps you can also show us your code? – G. Sliepen Jan 29 '17 at 14:16
  • thanks - as I told you, it contains nothing more than the standard client-server code. The two extra things is: i) write loop for the server and read loop for the client, ii) setsockopt to the client socket to increase the buffer size. My main concern remains: packet size 1368 looks "extremely too small" for the options I set and the socket I created. – Gireesh Punathil Jan 30 '17 at 05:09
  • Nevertheless, I have added my client server code in the question. – Gireesh Punathil Jan 30 '17 at 05:50