1

I've installed on 4 nodes a completly fresh OS with Proxmox. Every node has 2xNVMe und 1xHD, one NIC public, one NIC private. On the public network there is an additional wireguard interface running for PVE cluster communication. The private interface should be used only for the upcoming distributed storage.

# ip a s
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: enp3s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq state UP group default qlen 1000
    link/ether 6c:b3:11:07:f1:18 brd ff:ff:ff:ff:ff:ff
    inet 10.255.255.2/24 brd 10.255.255.255 scope global enp3s0
       valid_lft forever preferred_lft forever
    inet6 fe80::6eb3:11ff:fe07:f118/64 scope link 
       valid_lft forever preferred_lft forever
3: eno1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether b4:2e:... brd ff:ff:ff:ff:ff:ff
    inet 168..../26 brd 168....127 scope global eno1
       valid_lft forever preferred_lft forever
    inet6 2a01:.../128 scope global 
       valid_lft forever preferred_lft forever
    inet6 fe80::b62e:99ff:fecc:f5d0/64 scope link 
       valid_lft forever preferred_lft forever
4: vmbr0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
    link/ether a2:fd:6a:c7:f0:be brd ff:ff:ff:ff:ff:ff
    inet6 2a01:....::2/64 scope global 
       valid_lft forever preferred_lft forever
    inet6 fe80::..:f0be/64 scope link 
       valid_lft forever preferred_lft forever
6: wg0: <POINTOPOINT,NOARP,UP,LOWER_UP> mtu 1420 qdisc noqueue state UNKNOWN group default qlen 1000
    link/none 
    inet 10.3.0.10/32 scope global wg0
       valid_lft forever preferred_lft forever
    inet6 fd01:3::a/128 scope global 
       valid_lft forever preferred_lft forever

The nodes are fine and the PVE cluster is running as expected.

# pvecm status
Cluster information
-------------------
Name:             ac-c01
Config Version:   4
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Tue Dec 15 22:36:44 2020
Quorum provider:  corosync_votequorum
Nodes:            4
Node ID:          0x00000002
Ring ID:          1.11
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   4
Highest expected: 4
Total votes:      4
Quorum:           3  
Flags:            Quorate 

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 10.3.0.4
0x00000002          1 10.3.0.10 (local)
0x00000003          1 10.3.0.13
0x00000004          1 10.3.0.16

The PVE firewall is active in the cluster but there is a rule, that all PVE nodes can talk to each other on any protocol on any port on any interface. This is true - I can ping, ssh, etc. between all nodes on all IPs.

Then I installed ceph.

pveceph install

On the first node I've initialized ceph with

pveceph init -network 10.255.255.0/24
pveceph createmon

That works.

On the second - I tried the same (I'm not sure, if I need to set the -network option - I tried with and without). That works too.

But pveceph createmon fails on any node with:

# pveceph createmon
got timeout

I can also reach port 10.255.255.1:6789 on any node. Whatever I try - I'm getting a "got timeout" on any node then node1. Also disabling firewall doesn't have any effect.

When I remove the -network option, I can run all commands. It looks like it cannot talk via the second interface. But the interface is fine.

When I set network to 10.3.0.0/24 and cluster-network to 10.255.255.0/24 it works too, but I want all ceph communication running via 10.255.255.0/24. What is wrong?

TRW
  • 488
  • 3
  • 16
  • Were you able to fix your issue ? – Danyright Jun 12 '23 at 09:15
  • Yes and no, changing MTU size would help here, but that is not an option for my environment and I'm still irritated that just a simple monitor fails here and that it must be defined on all network interfaces too. That sounds more like "don't change a running system, because we don't know, what happens inside". I've checked GlusterFS (which works well in hybrid environment, but slow too). At the moment we use NFS with ZFS syncoid. – TRW Jun 13 '23 at 11:25

2 Answers2

1

Just for reference, the official documentation mentions jumbo frames as bringing important performance improvements:

https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/4/html/configuration_guide/ceph-network-configuration#verifying-and-configuring-the-mtu-value_conf

https://ceph.io/en/news/blog/2015/ceph-loves-jumbo-frames/

I for one, have seen read/write performances improvements of around 1400% after changing the MTU on the 6 nodes we set up (3 storage, 3 compute).

And no, this is not a typo. We went from 110 MB/s read/write with dd tests in Linux VMs to 1.5-1.6 GB/s afterwards (1 Gbps public network, 10 Gbps private network, OSD's on SATA SSDs).

Nota Bene: changing the MTU on all network interfaces (public AND private) seems quite important! In our case, changing it only on the private NICs made the whole system go haywire.

From Redhat's doc:

Important

Red Hat Ceph Storage requires the same MTU value throughout all networking devices in the communication path, end-to-end for both public and cluster networks.

I hope this helps someone! Cheers

Danyright
  • 203
  • 1
  • 7
0

The problem is - the MTU 9000 is a problem. Even when I run the complete Proxmox cluster via the private network, there are errors.

ip link set enp3s0 mtu 1500

So, Ceph has a problem with jumbo frames.

TRW
  • 488
  • 3
  • 16
  • Actually, that's not correct, Ceph works way better with jumbo frames, it's the recommended configuration. But I'm not familiar with proxmox so I can't really help. – eblock Jan 29 '21 at 08:22
  • Yes, you're right - I would prefer jumbo frames too. But I'm not sure, if it is a Proxmox problem or a problem with the provider (Hetzner) and the internal 10g connection. I don't know what hardware they use. But I've the same issues with Gluster too (same network). So I think - this is not an Ceph or Proxmox issue but Hetzner specific. – TRW Jan 30 '21 at 13:54