3

currently I try to setup a Gluster cluster and the performance is strange and I'm not sure, if I configured something wron. I'm using 4x Hetzner root server running Debian Buster with Intel i7, 128GB RAM, two NVMe's and one HDD. Every system has a separate 10Gbs network interface for internal communication (all hosts are directly connected to one switch on one rack).

When I test the network with iperf - I've got around 9.41 Gbits/sec between all peers.

I've installed the Debian default glusterfs-server packages (glusterfs-server_5.5-3_amd64.deb).

I've build three volumes with:

  • SSD (gv0) on /mnt/ssd/gfs/gv0
  • HDD (gv1) on /mnt/hdd/gfs/gv1
  • RAM-disc (gv2) on /mnt/ram/gfs/gv2

With

gluster volume create gv0 replica 2 transport tcp 10.255.255.1:/mnt/ssd/gfs/gv0 10.255.255.2:/mnt/ssd/gfs/gv0 10.255.255.3:/mnt/ssd/gfs/gv0 10.255.255.4:/mnt/ssd/gfs/gv0 force
...

And some configuration changes - all volumes look like this (gv0, gv1 and gv2 are the same)

# gluster volume info gv0
 
Volume Name: gv0
Type: Distributed-Replicate
Volume ID: 0fd68188-2b74-4050-831d-a590ef0faafd
Status: Started
Snapshot Count: 0
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: 10.255.255.1:/mnt/ssd/gfs/gv0
Brick2: 10.255.255.2:/mnt/ssd/gfs/gv0
Brick3: 10.255.255.3:/mnt/ssd/gfs/gv0
Brick4: 10.255.255.4:/mnt/ssd/gfs/gv0
Options Reconfigured:
performance.flush-behind: on
performance.cache-max-file-size: 512MB
performance.client-io-threads: off
nfs.disable: on
transport.address-family: inet

Later I found some optimizations in the net. But the performance doesn't change a lot (of course it is a single thread performance test).

# gluster volume info gv0
 
Volume Name: gv0
Type: Distributed-Replicate
Volume ID: 0fd68188-2b74-4050-831d-a590ef0faafd
Status: Started
Snapshot Count: 0
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: 10.255.255.1:/mnt/ssd/gfs/gv0
Brick2: 10.255.255.2:/mnt/ssd/gfs/gv0
Brick3: 10.255.255.3:/mnt/ssd/gfs/gv0
Brick4: 10.255.255.4:/mnt/ssd/gfs/gv0
Options Reconfigured:
performance.write-behind-window-size: 1MB
cluster.readdir-optimize: on
server.event-threads: 4
client.event-threads: 4
cluster.lookup-optimize: on
performance.readdir-ahead: on
performance.io-thread-count: 16
performance.io-cache: on
performance.flush-behind: on
performance.cache-max-file-size: 512MB
performance.client-io-threads: on
nfs.disable: on
transport.address-family: inet

Also I tried with jumbo frames and without it. But it also made no difference

# ip a s
...
2: enp3s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq state UP group default qlen 1000
    link/ether 6c:b3:11:07:f1:18 brd ff:ff:ff:ff:ff:ff
    inet 10.255.255.2/24 brd 10.255.255.255 scope global enp3s0
       valid_lft forever preferred_lft forever

All three volumes are mounted directly on one of the peers

10.255.255.1:gv0 /mnt/gv0 glusterfs defaults 0 0
10.255.255.1:gv1 /mnt/gv1 glusterfs defaults 0 0
10.255.255.1:gv2 /mnt/gv3 glusterfs defaults 0 0

Then I created some test data in a separate RAM disk. I wrote a script that generates with dd if=/dev/urandom and a for loop many files. I first generated the files, because /dev/urandom seems to be "end" at around 45Mb/s, when I write to a ram disk.

----- generate files 10240 x 100K
----- generate files 5120 x 1000K
----- generate files 1024 x 10000K
sum: 16000 MB on /mnt/ram1/

And now comes the transfer. I've just called cp -r /mnt/ram1/* /mnt/gv0/ etc. to write and cp -r /mnt/gv0/* /mnt/ram1/ and count the seconds. And that looks terrible.

                    read    write
ram <-> ram           4s       4s
ram <-> ssd           4s       7s
ram <-> hdd           4s       7s
ram <-> gv0 (ssd)   162s     145s
ram <-> gv1 (hdd)   164s     165s
ram <-> gv2 (ram)   158s     133s

So the performance of read and write with local disk compared and gluster cluster is around 40-time faster. That can't be.

What do I miss?

uberrebu
  • 503
  • 6
  • 17
  • 36
TRW
  • 488
  • 3
  • 16
  • seen many times but not a single comment or answer; wow – uberrebu Jun 23 '22 at 03:12
  • I believe, there is no solution. The problem is always the roundtrip time of IP packets in a network between all the nodes. I've tested several distributed file systems, they have always that problem. The performance is much better, when you have serial operations (less very big files instead of many small files), then it get's better, but that is often not the reality. For me - I switched to a single NFS instance with ZFS snapshot sync to a second node - which is not a fail safe but fast enough because the NFS instance doesnt need to wait for a ack from other nodes. – TRW Jun 24 '22 at 10:13
  • I will stick to local EXT storage for added thin provisioning support..."simple" shared storage is rocket science – uberrebu Jun 24 '22 at 10:16
  • What about using RDMA to communicate? Does GlusterFS support it and should it reduce latencies? – Nikita Kipriyanov Jun 24 '22 at 10:19
  • Often you can see, use as much harddiscs as possible to have high bandwith. But in case of a RAM disc there is no better performance. So we talk about network speed. Of course 10Gbs is not the best possible solution, but... the problem is not the amount of bytes to transfer here, but the time from "here you have a bit" til the moment where you know "most of my collegues have made an ACK" which says, file is stored. The fun part is - why is it so slow reading data. Because we need to ask the cluster, who has that file. And that is also a lot of time... So I understand the problem. – TRW Jun 24 '22 at 10:20
  • Yes, Gluster supports RDMA, but is removed after v8 - see https://docs.gluster.org/en/main/Administrator-Guide/RDMA-Transport/ – TRW Jun 24 '22 at 10:23

0 Answers0