2

I have a Ubuntu 16.04 server that is on a LAN with several dozen machines that need to read/write to it via samba shares. It was running a single gigabit card, but I decided to try bonding to improve the overall transfer rates in and out of the server. I installed four 1 gigabit cards, and have successfully configured a bond0 interface with the following

#> cat /etc/network/interfaces

# The loopback network interface
auto lo
iface lo inet loopback

auto enp1s6f0
iface enp1s6f0 inet manual
bond-master bond0

auto enp1s6f1
iface enp1s6f1 inet manual
bond-master bond0

auto enp1s7f0
iface enp1s7f0 inet manual
bond-master bond0

auto enp1s7f1
iface enp1s7f1 inet manual
bond-master bond0


# The primary network interface
auto bond0
iface bond0 inet static
address 192.168.111.8
netmask 255.255.255.0
network 192.168.111.0
broadcast 192.168.111.255
gateway 192.168.111.1
dns-nameservers 192.168.111.11
bond-mode 6
bond-miimon 100
bond-lacp-rate 1
bond-slaves enp1s6f0 enp1s6f1 enp1s7f0 enp1s7f1

#> ip addr

1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: enp1s6f0: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master bond0 state UP group default qlen 1000
    link/ether 00:09:6b:1a:03:6c brd ff:ff:ff:ff:ff:ff
3: enp1s6f1: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master bond0 state UP group default qlen 1000
    link/ether 00:09:6b:1a:03:6d brd ff:ff:ff:ff:ff:ff
4: enp1s7f0: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master bond0 state UP group default qlen 1000
    link/ether 00:09:6b:1a:01:ba brd ff:ff:ff:ff:ff:ff
5: enp1s7f1: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master bond0 state UP group default qlen 1000
    link/ether 00:09:6b:1a:01:bb brd ff:ff:ff:ff:ff:ff
6: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether 00:09:6b:1a:03:6d brd ff:ff:ff:ff:ff:ff
    inet 192.168.111.8/24 brd 192.168.111.255 scope global bond0
       valid_lft forever preferred_lft forever
    inet6 fe80::209:6bff:fe1a:36d/64 scope link
       valid_lft forever preferred_lft forever

#> ifconfig

bond0     Link encap:Ethernet  HWaddr 00:09:6b:1a:03:6d
          inet addr:192.168.111.8  Bcast:192.168.111.255  Mask:255.255.255.0
          inet6 addr: fe80::209:6bff:fe1a:36d/64 Scope:Link
          UP BROADCAST RUNNING MASTER MULTICAST  MTU:1500  Metric:1
          RX packets:30848499 errors:0 dropped:45514 overruns:0 frame:0
          TX packets:145615150 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:3344795597 (3.3 GB)  TX bytes:407934338759 (407.9 GB)

enp1s6f0  Link encap:Ethernet  HWaddr 00:09:6b:1a:03:6c
          UP BROADCAST RUNNING SLAVE MULTICAST  MTU:1500  Metric:1
          RX packets:7260526 errors:0 dropped:15171 overruns:0 frame:0
          TX packets:36216191 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:453705851 (453.7 MB)  TX bytes:101299060589 (101.2 GB)

enp1s6f1  Link encap:Ethernet  HWaddr 00:09:6b:1a:03:6d
          UP BROADCAST RUNNING SLAVE MULTICAST  MTU:1500  Metric:1
          RX packets:8355652 errors:0 dropped:0 overruns:0 frame:0
          TX packets:38404078 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:513634676 (513.6 MB)  TX bytes:107762014012 (107.7 GB)

enp1s7f0  Link encap:Ethernet  HWaddr 00:09:6b:1a:01:ba
          UP BROADCAST RUNNING SLAVE MULTICAST  MTU:1500  Metric:1
          RX packets:6140007 errors:0 dropped:15171 overruns:0 frame:0
          TX packets:36550756 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:382222165 (382.2 MB)  TX bytes:102450666514 (102.4 GB)

enp1s7f1  Link encap:Ethernet  HWaddr 00:09:6b:1a:01:bb
          UP BROADCAST RUNNING SLAVE MULTICAST  MTU:1500  Metric:1
          RX packets:9092314 errors:0 dropped:15171 overruns:0 frame:0
          TX packets:34444125 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:1995232905 (1.9 GB)  TX bytes:96422597644 (96.4 GB)

lo        Link encap:Local Loopback
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:65536  Metric:1
          RX packets:35 errors:0 dropped:0 overruns:0 frame:0
          TX packets:35 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1
          RX bytes:2640 (2.6 KB)  TX bytes:2640 (2.6 KB)

Testing transfer rates with 8 windows machines copying 2 TB files.

#> iftop -B -i bond0

              25.5MB         50.9MB         76.4MB         102MB     127MB
+-------------------------------------------------------------------------
192.168.111.8           => 192.168.111.186         11.8MB  12.4MB  14.7MB
                        <=                          126KB   124KB   102KB
192.168.111.8           => 192.168.111.181         12.4MB  12.1MB  7.83MB
                        <=                          121KB   105KB  55.1KB
192.168.111.8           => 192.168.111.130         11.5MB  11.0MB  12.6MB
                        <=                          106KB  88.5KB  77.1KB
192.168.111.8           => 192.168.111.172         10.4MB  10.9MB  14.2MB
                        <=                          105KB   100KB  92.2KB
192.168.111.8           => 192.168.111.179         9.76MB  9.86MB  4.20MB
                        <=                          101KB  77.0KB  28.8KB
192.168.111.8           => 192.168.111.182         9.57MB  9.72MB  5.97MB
                        <=                         91.4KB  72.4KB  37.9KB
192.168.111.8           => 192.168.111.161         8.01MB  9.51MB  12.9MB
                        <=                         71.5KB  60.6KB  72.7KB
192.168.111.8           => 192.168.111.165         9.46MB  5.29MB  1.32MB
                        <=                         100.0KB 58.2KB  14.6KB
192.168.111.8           => 192.168.111.11            73B    136B     56B
                        <=                          112B    198B     86B
192.168.111.255         => 192.168.111.132            0B      0B      0B
                        <=                          291B    291B    291B

--------------------------------------------------------------------------
TX:             cum:   3.61GB   peak:   85rates:   83.0MB  80.7MB  73.7MB
RX:                    22.0MB            823KB      823KB   687KB   481KB
TOTAL:                 3.63GB           86.0MB     83.8MB  81.4MB  74.2MB

As you can see on the iftop, I'm only getting transfer rates around 80MB/s, which is about the same as I was getting with just the single network card. My CPU runs about 90% idle, and the data is being read/written to a 14 drive ZFS, so I don't think I have any drive bottle necks. I don't have any fancy switches, just basic Netgear ProSafe switches like this: http://www.newegg.com/Product/Product.aspx?Item=N82E16833122058 but everything I have read about mode 5 and 6 say that special switches are not needed. I don't need individual connections to exceed 1GB, but my hope is that all total connections could surpass 1GB. Is there any additional configuration settings I'm missing, or is there some limitation with samba? If bonding can't do what I'm wanting, are there any other solutions I can use? Is SMB3 multi-channel production ready?

Edits below:

Below is outputs from the commands Tom asked for.

#> iostat -dx 5

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               0.00     0.00  489.00   11.80  6400.00    45.60    25.74     0.25    0.49    0.46    1.81   0.30  14.94
sdc               0.00     0.00  476.40   11.40  6432.80    44.00    26.56     0.28    0.57    0.55    1.61   0.32  15.76
sda               0.00     0.00  486.00   11.20  6374.40    43.20    25.81     0.26    0.53    0.50    1.84   0.31  15.36
sdh               0.00     0.00  489.60   13.00  6406.40    50.40    25.69     0.26    0.52    0.48    1.72   0.31  15.38
sdf               0.00     0.00  494.00   12.60  6376.00    48.80    25.36     0.26    0.52    0.49    1.67   0.31  15.88
sdd               0.00     0.00  481.60   12.00  6379.20    46.40    26.04     0.29    0.60    0.57    1.75   0.34  16.68
sde               0.00     0.00  489.80   12.20  6388.00    47.20    25.64     0.30    0.59    0.56    1.82   0.34  16.88
sdg               0.00     0.00  487.40   13.00  6400.80    50.40    25.78     0.27    0.53    0.50    1.75   0.32  16.24
sdj               0.00     0.00  481.40   11.40  6427.20    44.00    26.26     0.28    0.56    0.54    1.74   0.33  16.10
sdi               0.00     0.00  483.80   11.60  6424.00    44.80    26.12     0.26    0.52    0.49    1.67   0.31  15.14
sdk               0.00     0.00  492.60    8.60  6402.40    32.80    25.68     0.25    0.49    0.46    2.28   0.31  15.42
sdm               0.00     0.00  489.80   10.40  6421.60    40.00    25.84     0.25    0.51    0.47    2.23   0.32  16.18
sdn               0.00     0.00  489.60   10.00  6404.80    39.20    25.80     0.24    0.49    0.46    1.92   0.29  14.38
sdl               0.00     0.00  498.40    8.40  6392.00    32.00    25.35     0.25    0.50    0.47    1.93   0.31  15.48
sdo               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-0              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-1              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00

#> zpool iostat -v 5

                                                   capacity     operations    bandwidth
pool                                            alloc   free   read  write   read  write
----------------------------------------------  -----  -----  -----  -----  -----  -----
backup                                          28.9T  9.13T    534      0  65.9M      0
  raidz2                                        28.9T  9.13T    534      0  65.9M      0
    ata-Hitachi_HUA723030ALA640_MK0371YVHT17HA      -      -    422      0  4.77M      0
    ata-Hitachi_HUA723030ALA640_MK0371YVHSRD6A      -      -    413      0  4.79M      0
    ata-Hitachi_HUA723030ALA640_MK0371YVHRZWYA      -      -    415      0  4.78M      0
    ata-Hitachi_HUA723030ALA640_MK0371YVHSRS2A      -      -    417      0  4.77M      0
    ata-Hitachi_HUA723030ALA640_MK0371YVHR2DPA      -      -    397      0  4.83M      0
    ata-Hitachi_HUA723030ALA640_MK0371YVHN0P0A      -      -    418      0  4.78M      0
    ata-Hitachi_HUA723030ALA640_MK0371YVHU34LA      -      -    419      0  4.76M      0
    ata-Hitachi_HUA723030ALA640_MK0371YVHRHUEA      -      -    417      0  4.78M      0
    ata-Hitachi_HUA723030ALA640_MK0371YVHM0HBA      -      -    413      0  4.78M      0
    ata-Hitachi_HUA723030ALA640_MK0371YVHJG4LA      -      -    410      0  4.79M      0
    ata-Hitachi_HUA723030ALA640_MK0371YVHST58A      -      -    417      0  4.78M      0
    ata-Hitachi_HUA723030ALA640_MK0371YVHS0G5A      -      -    418      0  4.78M      0
    ata-Hitachi_HUA723030ALA640_MK0371YVHN2D4A      -      -    414      0  4.80M      0
    ata-Hitachi_HUA723030ALA640_MK0371YVHR2G5A      -      -    417      0  4.79M      0
----------------------------------------------  -----  -----  -----  -----  -----  -----

So I do have several switches in the office, but currently this machine has all four network ports plugged into the same 24 port switch that the client windows machines are connected to, thus all of this traffic should be contained within this switch. Traffic to the internet and our internal DNS would need to go through a link to another switch, but I don't think that should affect this issue.

Edit #2, added some additional info

#> cat /proc/net/bonding/bond0

Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)

Bonding Mode: adaptive load balancing
Primary Slave: None
Currently Active Slave: enp1s6f1
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 0
Down Delay (ms): 0

Slave Interface: enp1s6f1
MII Status: up
Speed: 1000 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: 00:09:6b:1a:03:6d
Slave queue ID: 0

Slave Interface: enp1s6f0
MII Status: up
Speed: 1000 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: 00:09:6b:1a:03:6c
Slave queue ID: 0

Slave Interface: enp1s7f0
MII Status: up
Speed: 1000 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: 00:09:6b:1a:01:ba
Slave queue ID: 0

Slave Interface: enp1s7f1
MII Status: up
Speed: 1000 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: 00:09:6b:1a:01:bb
Slave queue ID: 0

Edit number 3

# >zfs list -o name,recordsize,compression

NAME                    RECSIZE  COMPRESS
backup                     128K       off
backup/Accounting          128K       off
backup/Archive             128K       off
backup/Documents           128K       off
backup/Library             128K       off
backup/Media               128K       off
backup/photos              128K       off
backup/Projects            128K       off
backup/Temp                128K       off
backup/Video               128K       off
backup/Zip                 128K       off

Disk reading tests. Single file read:

#>dd if=MasterDynamic_Spray_F1332.tpc of=/dev/null

9708959+1 records in
9708959+1 records out
4970987388 bytes (5.0 GB, 4.6 GiB) copied, 77.755 s, 63.9 MB/s

While the above dd test was running, I pulled a zpool iostat:

#>zpool iostat -v 5

                                                   capacity     operations    bandwidth
pool                                            alloc   free   read  write   read  write
----------------------------------------------  -----  -----  -----  -----  -----  -----
backup                                          28.9T  9.07T    515      0  64.0M      0
  raidz2                                        28.9T  9.07T    515      0  64.0M      0
    ata-Hitachi_HUA723030ALA640_MK0371YVHT17HA      -      -    413      0  4.62M      0
    ata-Hitachi_HUA723030ALA640_MK0371YVHSRD6A      -      -    429      0  4.60M      0
    ata-Hitachi_HUA723030ALA640_MK0371YVHRZWYA      -      -    431      0  4.59M      0
    ata-Hitachi_HUA723030ALA640_MK0371YVHSRS2A      -      -    430      0  4.59M      0
    ata-Hitachi_HUA723030ALA640_MK0371YVHR2DPA      -      -    432      0  4.60M      0
    ata-Hitachi_HUA723030ALA640_MK0371YVHN0P0A      -      -    427      0  4.60M      0
    ata-Hitachi_HUA723030ALA640_MK0371YVHU34LA      -      -    405      0  4.65M      0
    ata-Hitachi_HUA723030ALA640_MK0371YVHRHUEA      -      -    430      0  4.58M      0
    ata-Hitachi_HUA723030ALA640_MK0371YVHM0HBA      -      -    431      0  4.58M      0
    ata-Hitachi_HUA723030ALA640_MK0371YVHJG4LA      -      -    427      0  4.60M      0
    ata-Hitachi_HUA723030ALA640_MK0371YVHST58A      -      -    429      0  4.59M      0
    ata-Hitachi_HUA723030ALA640_MK0371YVHS0G5A      -      -    428      0  4.59M      0
    ata-Hitachi_HUA723030ALA640_MK0371YVHN2D4A      -      -    427      0  4.60M      0
    ata-Hitachi_HUA723030ALA640_MK0371YVHR2G5A      -      -    428      0  4.59M      0
----------------------------------------------  -----  -----  -----  -----  -----  -----
Ray Collett
  • 23
  • 1
  • 4
  • During a load test, please collect the output of `iostat -dx 5`on the Ubuntu server and show a representative sample (note that you can ignore the first block of output). Also `zpool iostat -v 5`. You mention "switches" (plural) -- what is the topology of the network between the server and clients? (Side note: `bond-lacp-rate` is a bit out of place because this is not an LACP bond) – Tom Shaw Sep 07 '16 at 03:10
  • I have edited to include the iostat and zpool commands, as well as my switch layout. Thanks! – Ray Collett Sep 07 '16 at 17:16
  • Could you verfiy that the clients are connected with 1Gbit? The output of `iftop` for the clients is pretty close to 100Mbit connection speed. – Thomas Sep 07 '16 at 18:14
  • Confirmed, clients are at 1000Mbit (1Gbit). If I only transfer files with a single client, I easily get 80 - 90 MB/s. Having 8 clients pull at the same time does tend to make their individual stats look a lot like 100Mbit. – Ray Collett Sep 07 '16 at 18:44
  • Thanks for the update. What is the recordsize of your filesystem? Is compression on? (Please provide `zfs list -o name,recordsize,compression` for the filesystem in question) – Tom Shaw Sep 07 '16 at 22:49
  • I've updated my answer based on your edits so far – Tom Shaw Sep 07 '16 at 23:03

2 Answers2

1

The ifconfig output shows that the transmit bytes are evenly balanced across all four interfaces, so it's working in that sense.

Based on the iostat output this looks like a disk IOPS (I/Os per second) bottleneck to me. Each disk is performing around 400-500 IOPS of size 12-16kB on average. If these I/Os are not sequential, then you're probably hitting the random I/O limit for the drives. On traditional spinning disk, this is due to a combination of the rotational speed and the time taken to move the read head -- a purely random workload on these disks would top out at 100 IOPS.

This is made worse by the way ZFS handles striping. Unlike traditional RAID-5 or RAID-6, the ZFS equivalents raidz and raidz2 force the drives into lockstep. Effectively you get the random IOPS of only one drive even though you have 14 in the pool.

You should test again to isolate the disk performance. Either do reads on their own (e.g. several of these at the same time: dd if=bigfile of=/dev/null) or try a pure network load test like iPerf.

Tom Shaw
  • 3,752
  • 16
  • 23
  • This is a _terrible_ ZFS zpool setup. Man... – ewwhite Sep 08 '16 at 00:40
  • @ewwhite It's the difference between ZFS hype and reality unfortunately. If you care about performance you need to use mirrored devices or a plain stripe of fault-tolerant devices (i.e. SAN LUNs). In my opinion as soon as Sun developed the separate ZIL they should have provided the option of traditional RAID-5 or RAID-6 along with it, to avoid the lockstep problem. – Tom Shaw Sep 08 '16 at 01:27
  • See edits to main comment. Looks like my bottle neck is the ZFS disk array. I guess my primary question is answered, the bonding is probably preforming as it should. Is there anything I can do with the settings of my ZFS to improve it's performance without rebuilding it? Maybe I need to pose a new question to SF. – Ray Collett Sep 08 '16 at 23:12
  • 1
    Normally I would say to increase the recordsize and then re-create the files. This would improve performance in this particular case at the expense of some other cases. However, a quick look at the Ubuntu manpage shows that the maximum recordsize on Ubuntu is 128kB (whereas on Solaris it is 1MB). So I think you have to rebuild. – Tom Shaw Sep 09 '16 at 02:06
1

Mode 0 (round-robin), Mode 3 (broadcast), Mode 5 (balance-tlb), and Mode 6 (balance-alb) are all terrible bonding modes for TCP streams like Samba, CIFS, NFS, ISCSI, etc because those modes do not guarantee in-order delivery of traffic, which is something that TCP relies on to avoid TCP congestion control.

For a single TCP stream like you need, you should be using Mode 1 (active-backup), Mode 2 (balance-xor), or Mode 4 (802.3ad). You will be limited to the speed of a single slave. There is no good way to balance a single large TCP stream across multiple physical interfaces.

If you need faster than one slave, get faster network infrastructure.

suprjami
  • 3,536
  • 21
  • 29
  • Shouldn't Mode 6 be good enough to get up to 1 Gbps per client system? That is, if you have 4 links in mode 6 bonding, a single client will get up to 1 Gbps link speed but total speed with lots of clients would be 4 Gbps assuming enough clients to fill the pipes. The Mode 4 is easily the best setup but it requires special support from the switch. – Mikko Rantalainen May 31 '23 at 07:43
  • 1
    Since I made that comment in 2017, mode 5 and 6 have gained the `tlb_dynamic_lb=0` option, which causes a mode 5 or 6 bond to hash traffic out like modes 2 and 4 do. If you're using the default `tlb_dynamic_lb=1` then traffic is sent out whichever slave has the least load at the time, so ordered delivery is not guaranteed and TCP suffers. – suprjami Jun 01 '23 at 09:28