1

I have 4 disks available on my virtual machine for testing sdb, sdc, sdd, and sde.

The first 3 disks are used for a RAID5 configuration, the last disk is used as lvm cache drive.

What I don't understand is the following:

When I create a cache disk of 50GB with a chunk size of 64KiB, xfs_info gives me the following:

[vagrant@node-02 ~]$ xfs_info /data
meta-data=/dev/mapper/data-data isize=512    agcount=32, agsize=16777072 blks
         =                       sectsz=512   attr=2, projid32bit=1
         =                       crc=1        finobt=0 spinodes=0
data     =                       bsize=4096   blocks=536866304, imaxpct=5
         =                       sunit=16     swidth=32 blks
naming   =version 2              bsize=8192   ascii-ci=0 ftype=1
log      =internal               bsize=4096   blocks=262144, version=2
         =                       sectsz=512   sunit=16 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

As we can see here the sunit=16 and the swidth=32 seems to be correct and matching the raid5 layout.

The results of lsblk -t

[vagrant@node-02 ~]$ lsblk -t
NAME                         ALIGNMENT MIN-IO OPT-IO PHY-SEC LOG-SEC ROTA SCHED    RQ-SIZE   RA WSAME
sda                                  0    512      0     512     512    1 deadline     128 4096    0B
├─sda1                               0    512      0     512     512    1 deadline     128 4096    0B
└─sda2                               0    512      0     512     512    1 deadline     128 4096    0B
  ├─centos-root                      0    512      0     512     512    1              128 4096    0B
  ├─centos-swap                      0    512      0     512     512    1              128 4096    0B
  └─centos-home                      0    512      0     512     512    1              128 4096    0B
sdb                                  0    512      0     512     512    1 deadline     128 4096   32M
├─data5-data5_corig_rmeta_0          0    512      0     512     512    1              128 4096   32M
│ └─data5-data5_corig                0  65536 131072     512     512    1              128  384    0B
│   └─data5-data5                    0  65536 131072     512     512    1              128 4096    0B
└─data5-data5_corig_rimage_0         0    512      0     512     512    1              128 4096   32M
  └─data5-data5_corig                0  65536 131072     512     512    1              128  384    0B
    └─data5-data5                    0  65536 131072     512     512    1              128 4096    0B
sdc                                  0    512      0     512     512    1 deadline     128 4096   32M
├─data5-data5_corig_rmeta_1          0    512      0     512     512    1              128 4096   32M
│ └─data5-data5_corig                0  65536 131072     512     512    1              128  384    0B
│   └─data5-data5                    0  65536 131072     512     512    1              128 4096    0B
└─data5-data5_corig_rimage_1         0    512      0     512     512    1              128 4096   32M
  └─data5-data5_corig                0  65536 131072     512     512    1              128  384    0B
    └─data5-data5                    0  65536 131072     512     512    1              128 4096    0B
sdd                                  0    512      0     512     512    1 deadline     128 4096   32M
├─data5-data5_corig_rmeta_2          0    512      0     512     512    1              128 4096   32M
│ └─data5-data5_corig                0  65536 131072     512     512    1              128  384    0B
│   └─data5-data5                    0  65536 131072     512     512    1              128 4096    0B
└─data5-data5_corig_rimage_2         0    512      0     512     512    1              128 4096   32M
  └─data5-data5_corig                0  65536 131072     512     512    1              128  384    0B
    └─data5-data5                    0  65536 131072     512     512    1              128 4096    0B
sde                                  0    512      0     512     512    1 deadline     128 4096   32M
sdf                                  0    512      0     512     512    1 deadline     128 4096   32M
├─data5-cache_data5_cdata            0    512      0     512     512    1              128 4096   32M
│ └─data5-data5                      0  65536 131072     512     512    1              128 4096    0B
└─data5-cache_data5_cmeta            0    512      0     512     512    1              128 4096   32M
  └─data5-data5                      0  65536 131072     512     512    1              128 4096    0B
sdg                                  0    512      0     512     512    1 deadline     128 4096   32M
sdh                                  0    512      0     512     512    1 deadline     128 4096   32M

And lvdisplay -a -m data gives me the following:

[vagrant@node-02 ~]$ sudo lvdisplay -m -a data
  --- Logical volume ---
  LV Path                /dev/data/data
  LV Name                data
  VG Name                data
  LV UUID                MBG1p8-beQj-TNDd-Cyx4-QkyN-vdVk-dG6n6I
  LV Write Access        read/write
  LV Creation host, time node-02, 2019-09-03 13:22:08 +0000
  LV Cache pool name     cache_data
  LV Cache origin name   data_corig
  LV Status              available
  # open                 1
  LV Size                <2.00 TiB
  Cache used blocks      0.06%
  Cache metadata blocks  0.64%
  Cache dirty blocks     0.00%
  Cache read hits/misses 293 / 66
  Cache wrt hits/misses  59 / 41173
  Cache demotions        0
  Cache promotions       486
  Current LE             524284
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     8192
  Block device           253:9

  --- Segments ---
  Logical extents 0 to 524283:
    Type                cache
    Chunk size          64.00 KiB
    Metadata format     2
    Mode                writethrough
    Policy              smq


  --- Logical volume ---
  Internal LV Name       cache_data
  VG Name                data
  LV UUID                apACl6-DtfZ-TURM-vxjD-UhxF-tthY-uSYRGq
  LV Write Access        read/write
  LV Creation host, time node-02, 2019-09-03 13:22:16 +0000
  LV Pool metadata       cache_data_cmeta
  LV Pool data           cache_data_cdata
  LV Status              NOT available
  LV Size                50.00 GiB
  Current LE             12800
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto

  --- Segments ---
  Logical extents 0 to 12799:
    Type                cache-pool
    Chunk size          64.00 KiB
    Metadata format     2
    Mode                writethrough
    Policy              smq


  --- Logical volume ---
  Internal LV Name       cache_data_cmeta
  VG Name                data
  LV UUID                hmkW6M-CKGO-CTUP-rR4v-KnWn-DbBZ-pJeEA2
  LV Write Access        read/write
  LV Creation host, time node-02, 2019-09-03 13:22:15 +0000
  LV Status              available
  # open                 1
  LV Size                1.00 GiB
  Current LE             256
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     8192
  Block device           253:11

  --- Segments ---
  Logical extents 0 to 255:
    Type                linear
    Physical volume     /dev/sdf
    Physical extents    0 to 255


  --- Logical volume ---
  Internal LV Name       cache_data_cdata
  VG Name                data
  LV UUID                9mHe8J-SRiY-l1gl-TO1h-2uCC-Hi10-UpeEVP
  LV Write Access        read/write
  LV Creation host, time node-02, 2019-09-03 13:22:16 +0000
  LV Status              available
  # open                 1
  LV Size                50.00 GiB
  Current LE             12800
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     8192
  Block device           253:10

  --- Segments ---
  Logical extents 0 to 12799:
    Type                linear
    Physical volume     /dev/sdf
    Physical extents    256 to 13055

  --- Logical volume ---
  Internal LV Name       data_corig
  VG Name                data
  LV UUID                QP8ppy-nv1v-0sii-tANA-6ZzK-EJkP-sLfrh4
  LV Write Access        read/write
  LV Creation host, time node-02, 2019-09-03 13:22:17 +0000
  LV origin of Cache LV  data
  LV Status              available
  # open                 1
  LV Size                <2.00 TiB
  Current LE             524284
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     768
  Block device           253:12

  --- Segments ---
  Logical extents 0 to 524283:
    Type                raid5
    Monitoring          monitored
    Raid Data LV 0
      Logical volume    data_corig_rimage_0
      Logical extents   0 to 262141
    Raid Data LV 1
      Logical volume    data_corig_rimage_1
      Logical extents   0 to 262141
    Raid Data LV 2
      Logical volume    data_corig_rimage_2
      Logical extents   0 to 262141
    Raid Metadata LV 0  data_corig_rmeta_0
    Raid Metadata LV 1  data_corig_rmeta_1
    Raid Metadata LV 2  data_corig_rmeta_2


[vagrant@node-02 ~]$
[vagrant@node-02 ~]$   --- Segments ---
Df7SLj
  LV Write Access        read/write
  LV Creation host, time node-02, 2019-09-03 13:22:08 +0000
  LV Status              available
  # open                 1
  LV Size                1023.99 GiB
  Current LE             262142
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     8192
  Block device           253:8

  --- Segments ---
  Logical extents 0 to 262141:
    Type                linear
    Physical volume     /dev/sdd
    Physical extents    1 to 262142


  --- Logical volume ---
  Internal LV Name       data_corig_rmeta_2
  VG Name                data
  LV UUID                xi9Ot3-aTnp-bA3z-YL0x-eVaB-87EP-JSM3eN
  LV Write Access        read/write
  LV Creation host, time node-02, 2019-09-03 13:22:08 +0000
  LV Status              available
  # open                 1
  LV Size                4.00 MiB
  Current LE             1
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     8192
  Block device           253:7

  --- Segments ---
  Logical extents 0 to 0:
    Type                linear
    Physical volume     /dev/sdd
    Physical extents    0 to 0


  --- Logical volume ---
  Internal LV Name       data_corig
  VG Name                data
  LV UUID                QP8ppy-nv1v-0sii-tANA-6ZzK-EJkP-sLfrh4
  LV Write Access        read/write
  LV Creation host, time node-02, 2019-09-03 13:22:17 +0000
  LV origin of Cache LV  data
  LV Status              available
  # open                 1
  LV Size                <2.00 TiB
  Current LE             524284
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     768
  Block device           253:12

  --- Segments ---
  Logical extents 0 to 524283:
    Type                raid5
    Monitoring          monitored
    Raid Data LV 0
      Logical volume    data_corig_rimage_0
      Logical extents   0 to 262141
    Raid Data LV 1
      Logical volume    data_corig_rimage_1
      Logical extents   0 to 262141
    Raid Data LV 2
      Logical volume    data_corig_rimage_2
      Logical extents   0 to 262141
    Raid Metadata LV 0  data_corig_rmeta_0
    Raid Metadata LV 1  data_corig_rmeta_1
    Raid Metadata LV 2  data_corig_rmeta_2

We can clearly see the chunk size of 64KiB in the segments.

But when I create a cache disk of 250GB lvm needs at least a chunk size of 288KiB for that cache disk to accommodate the size. But when I execute xfs_info the sunit/swidth values suddenly match that of the cache drive instead of the RAID5 layout.

Output xfs_info

[vagrant@node-02 ~]$ xfs_info /data
meta-data=/dev/mapper/data-data isize=512    agcount=32, agsize=16777152 blks
         =                       sectsz=512   attr=2, projid32bit=1
         =                       crc=1        finobt=0 spinodes=0
data     =                       bsize=4096   blocks=536866816, imaxpct=5
         =                       sunit=72     swidth=72 blks
naming   =version 2              bsize=8192   ascii-ci=0 ftype=1
log      =internal               bsize=4096   blocks=262144, version=2
         =                       sectsz=512   sunit=8 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

Suddenly we have a sunit and swidth of 72 which match the chunk size of 288KiB of the cache drive, we can see this with lvdisplay -m -a

[vagrant@node-02 ~]$ sudo lvdisplay -m -a data
  --- Logical volume ---
  LV Path                /dev/data/data
  LV Name                data
  VG Name                data
  LV UUID                XLHw3w-RkG9-UNh6-WZBM-HtjM-KcV6-6dOdnG
  LV Write Access        read/write
  LV Creation host, time node-2, 2019-09-03 13:36:32 +0000
  LV Cache pool name     cache_data
  LV Cache origin name   data_corig
  LV Status              available
  # open                 1
  LV Size                <2.00 TiB
  Cache used blocks      0.17%
  Cache metadata blocks  0.71%
  Cache dirty blocks     0.00%
  Cache read hits/misses 202 / 59
  Cache wrt hits/misses  8939 / 34110
  Cache demotions        0
  Cache promotions       1526
  Current LE             524284
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     8192
  Block device           253:9

  --- Segments ---
  Logical extents 0 to 524283:
    Type                cache
    Chunk size          288.00 KiB
    Metadata format     2
    Mode                writethrough
    Policy              smq


  --- Logical volume ---
  Internal LV Name       cache_data
  VG Name                data
  LV UUID                Ps7Z1P-y5Ae-ju80-SZjc-yB6S-YBtx-SWL9vO
  LV Write Access        read/write
  LV Creation host, time node-2, 2019-09-03 13:36:40 +0000
  LV Pool metadata       cache_data_cmeta
  LV Pool data           cache_data_cdata
  LV Status              NOT available
  LV Size                250.00 GiB
  Current LE             64000
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto

  --- Segments ---
  Logical extents 0 to 63999:
    Type                cache-pool
    Chunk size          288.00 KiB
    Metadata format     2
    Mode                writethrough
    Policy              smq


  --- Logical volume ---
  Internal LV Name       cache_data_cmeta
  VG Name                data
  LV UUID                k4rVn9-lPJm-2Vvt-77jw-NP1K-PTOs-zFy2ph
  LV Write Access        read/write
  LV Creation host, time node-2, 2019-09-03 13:36:39 +0000
  LV Status              available
  # open                 1
  LV Size                1.00 GiB
  Current LE             256
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     8192
  Block device           253:11

  --- Segments ---
  Logical extents 0 to 255:
    Type                linear
    Physical volume     /dev/sdf
    Physical extents    0 to 255


  --- Logical volume ---
  Internal LV Name       cache_data_cdata
  VG Name                data
  LV UUID                dm571W-f9eX-aFMA-SrPC-PYdd-zs45-ypLksd
  LV Write Access        read/write
  LV Creation host, time node-2, 2019-09-03 13:36:39 +0000
  LV Status              available
  # open                 1
  LV Size                250.00 GiB
  Current LE             64000
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     8192
  Block device           253:10

  --- Logical volume ---
  Internal LV Name       data_corig
  VG Name                data
  LV UUID                hbYiRO-YnV8-gd1B-shQD-N3SR-xpTl-rOjX8V
  LV Write Access        read/write
  LV Creation host, time node-2, 2019-09-03 13:36:41 +0000
  LV origin of Cache LV  data
  LV Status              available
  # open                 1
  LV Size                <2.00 TiB
  Current LE             524284
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     768
  Block device           253:12

  --- Segments ---
  Logical extents 0 to 524283:
    Type                raid5
    Monitoring          monitored
    Raid Data LV 0
      Logical volume    data_corig_rimage_0
      Logical extents   0 to 262141
    Raid Data LV 1
      Logical volume    data_corig_rimage_1
      Logical extents   0 to 262141
    Raid Data LV 2
      Logical volume    data_corig_rimage_2
      Logical extents   0 to 262141
    Raid Metadata LV 0  data_corig_rmeta_0
    Raid Metadata LV 1  data_corig_rmeta_1
    Raid Metadata LV 2  data_corig_rmeta_2

And the output of lsblk -t

[vagrant@node-02 ~]$ lsblk -t
NAME                         ALIGNMENT MIN-IO OPT-IO PHY-SEC LOG-SEC ROTA SCHED    RQ-SIZE   RA WSAME
sda                                  0    512      0     512     512    1 deadline     128 4096    0B
├─sda1                               0    512      0     512     512    1 deadline     128 4096    0B
└─sda2                               0    512      0     512     512    1 deadline     128 4096    0B
  ├─centos-root                      0    512      0     512     512    1              128 4096    0B
  ├─centos-swap                      0    512      0     512     512    1              128 4096    0B
  └─centos-home                      0    512      0     512     512    1              128 4096    0B
sdb                                  0    512      0     512     512    1 deadline     128 4096   32M
├─data5-data5_corig_rmeta_0          0    512      0     512     512    1              128 4096   32M
│ └─data5-data5_corig                0  65536 131072     512     512    1              128  384    0B
│   └─data5-data5                    0 294912 294912     512     512    1              128 4096    0B
└─data5-data5_corig_rimage_0         0    512      0     512     512    1              128 4096   32M
  └─data5-data5_corig                0  65536 131072     512     512    1              128  384    0B
    └─data5-data5                    0 294912 294912     512     512    1              128 4096    0B
sdc                                  0    512      0     512     512    1 deadline     128 4096   32M
├─data5-data5_corig_rmeta_1          0    512      0     512     512    1              128 4096   32M
│ └─data5-data5_corig                0  65536 131072     512     512    1              128  384    0B
│   └─data5-data5                    0 294912 294912     512     512    1              128 4096    0B
└─data5-data5_corig_rimage_1         0    512      0     512     512    1              128 4096   32M
  └─data5-data5_corig                0  65536 131072     512     512    1              128  384    0B
    └─data5-data5                    0 294912 294912     512     512    1              128 4096    0B
sdd                                  0    512      0     512     512    1 deadline     128 4096   32M
├─data5-data5_corig_rmeta_2          0    512      0     512     512    1              128 4096   32M
│ └─data5-data5_corig                0  65536 131072     512     512    1              128  384    0B
│   └─data5-data5                    0 294912 294912     512     512    1              128 4096    0B
└─data5-data5_corig_rimage_2         0    512      0     512     512    1              128 4096   32M
  └─data5-data5_corig                0  65536 131072     512     512    1              128  384    0B
    └─data5-data5                    0 294912 294912     512     512    1              128 4096    0B
sde                                  0    512      0     512     512    1 deadline     128 4096   32M
sdf                                  0    512      0     512     512    1 deadline     128 4096   32M
├─data5-cache_data5_cdata            0    512      0     512     512    1              128 4096   32M
│ └─data5-data5                      0 294912 294912     512     512    1              128 4096    0B
└─data5-cache_data5_cmeta            0    512      0     512     512    1              128 4096   32M
  └─data5-data5                      0 294912 294912     512     512    1              128 4096    0B
sdg                                  0    512      0     512     512    1 deadline     128 4096   32M
sdh                                  0    512      0     512     512    1 deadline     128 4096   32M

A few questions arise here.

XFS Autodetect these settings apparently, but why does XFS chooses to use the chunk size of the cache drive? It is able to autodetect the RAID5 layout as we could see in the first example.

I know that I can pass the su/sw options to mkfs.xfs to get the correct sunit/swidth values, but should I do this in this case??

http://xfs.org/index.php/XFS_FAQ#Q:_How_to_calculate_the_correct_sunit.2Cswidth_values_for_optimal_performance

I googled for days now, I looked in the XFS source code but I wasn't able to find any clue why XFS does this.

So the questions that arises:

  • Why does XFS behave like this?
  • Should I define the su/sw manually when running the mkfs.xfs
  • Does the chunk size of the cache drive have influence on the RAID5 setup, and should this be aligned somehow?

1 Answers1

2

Optimal allocation policy is a complex problem, as it depends on how the various block layer interact between themselves.

In determining the optimal allocation policy, mkfs.xfs uses the information provided by libblkid. You can access the same information issuing lsblk -t. It is very probable that mkfs.xfs uses the 288K allocation alignment because lvs (well, device-mapper actually) simply pass that value up to the stack.

I saw a very similar behavior with thin provisioning, where mkfs.xfs aligns the filesystem on think chunk size.

EDIT: so, this is the output of lsblk -t...

[vagrant@node-02 ~]$ lsblk -t
NAME                         ALIGNMENT MIN-IO OPT-IO PHY-SEC LOG-SEC ROTA SCHED    RQ-SIZE   RA WSAME
sda                                  0    512      0     512     512    1 deadline     128 4096    0B
├─sda1                               0    512      0     512     512    1 deadline     128 4096    0B
└─sda2                               0    512      0     512     512    1 deadline     128 4096    0B
  ├─centos-root                      0    512      0     512     512    1              128 4096    0B
  ├─centos-swap                      0    512      0     512     512    1              128 4096    0B
  └─centos-home                      0    512      0     512     512    1              128 4096    0B
sdb                                  0    512      0     512     512    1 deadline     128 4096   32M
├─data5-data5_corig_rmeta_0          0    512      0     512     512    1              128 4096   32M
│ └─data5-data5_corig                0  65536 131072     512     512    1              128  384    0B
│   └─data5-data5                    0 294912 294912     512     512    1              128 4096    0B
└─data5-data5_corig_rimage_0         0    512      0     512     512    1              128 4096   32M
  └─data5-data5_corig                0  65536 131072     512     512    1              128  384    0B
    └─data5-data5                    0 294912 294912     512     512    1              128 4096    0B
sdc                                  0    512      0     512     512    1 deadline     128 4096   32M
├─data5-data5_corig_rmeta_1          0    512      0     512     512    1              128 4096   32M
│ └─data5-data5_corig                0  65536 131072     512     512    1              128  384    0B
│   └─data5-data5                    0 294912 294912     512     512    1              128 4096    0B
└─data5-data5_corig_rimage_1         0    512      0     512     512    1              128 4096   32M
  └─data5-data5_corig                0  65536 131072     512     512    1              128  384    0B
    └─data5-data5                    0 294912 294912     512     512    1              128 4096    0B
sdd                                  0    512      0     512     512    1 deadline     128 4096   32M
├─data5-data5_corig_rmeta_2          0    512      0     512     512    1              128 4096   32M
│ └─data5-data5_corig                0  65536 131072     512     512    1              128  384    0B
│   └─data5-data5                    0 294912 294912     512     512    1              128 4096    0B
└─data5-data5_corig_rimage_2         0    512      0     512     512    1              128 4096   32M
  └─data5-data5_corig                0  65536 131072     512     512    1              128  384    0B
    └─data5-data5                    0 294912 294912     512     512    1              128 4096    0B
sde                                  0    512      0     512     512    1 deadline     128 4096   32M
sdf                                  0    512      0     512     512    1 deadline     128 4096   32M
├─data5-cache_data5_cdata            0    512      0     512     512    1              128 4096   32M
│ └─data5-data5                      0 294912 294912     512     512    1              128 4096    0B
└─data5-cache_data5_cmeta            0    512      0     512     512    1              128 4096   32M
  └─data5-data5                      0 294912 294912     512     512    1              128 4096    0B
sdg                                  0    512      0     512     512    1 deadline     128 4096   32M
sdh                                  0    512      0     512     512    1 deadline     128 4096   32M

As you can see, the data5-data5 device (on top of which you create an xfs filesystem), reports MIN-IO and OPT-IO of 294912 bytes (288K, your cache chunk), while the underlying devices reports the RAID array chunk size (64K). This means that device-mapper overwrote the underlying IO information with the current cache chunk size.

mkfs.xfs simply uses what libblkid reported which, in turn, depends on the specific cache device mapper target being used.

shodanshok
  • 47,711
  • 7
  • 111
  • 180
  • I updated the question with the output of `lsblk -t`, I can't see any difference in the output between the 2 situations. Also it appears `device-mapper` passes the correct values when the cache chunksize is small, thus somewhere `device-mapper` or `xfs` decides to use the cache chunksize but I can't find anything about this in the docs or even source code. btw: I like complex problems ;) – Sander Visser Sep 03 '19 at 15:28
  • 1
    Your `lsblk -t` shows the root issue very well. I'll update my answer. – shodanshok Sep 03 '19 at 16:14
  • Aaah yes I see then last question why is MIN-IO / OPT-IO the same as the underlying RAID layout with a smaller chunksize? And is it beter to specify `sw/su` in `mkfs.xfs` or should I use the values of the cache drive? – Sander Visser Sep 03 '19 at 16:46
  • In the end we use hardware RAID so we need to specify the `su/sw` anyway, but which values to use. – Sander Visser Sep 03 '19 at 16:53
  • @SanderVisser this is a different problem, please open a new question. – shodanshok Sep 03 '19 at 22:20