I am in the process of upgrading storage for our MogileFS cluster and am using the rebalance and device drain features to migrate data from one set of devices to another. We have about 55 TB on one set of devices that I would like to migrate to a new set of devices 88 TB free.
I have the following policy setup:
[ashinn@mogile2 ~]$ sudo mogadm rebalance settings
rebal_policy = from_devices=2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020,2021,2022,2023,2024,2025,2026,2027,2028 to_hosts=5,6,7 leave_in_drain_mode=1
But it appear to only drain/rebalance one device at a time:
[ashinn@mogile2 ~]$ sudo mogadm rebalance status
Rebalance is running
Rebalance status:
bytes_queued = 250755303323
completed_devs =
fids_queued = 7785000
limit = 0
sdev_current = 2005
sdev_lastfid = 1444986524
sdev_limit = none
source_devs = 2016,2028,2007,2013,2012,2022,2008,2001,2024,2017,2023,2025,2009,2015,2006,2026,2021,2020,2019,2010,2027,2004,2018,2014,2002,2011,2003
time_finished = 0
time_started = 1340960590
time_stopped = 0
At this rate, it would take 4 months to drain all the old devices and rebalance to the new ones! Here is a list of the devices I am trying to drain and the new ones added. dev2001 to dev2028 are set to drain and rebalance to all 3 hosts (including the new devices dev2029 to dev2036 on host id 6):
[ashinn@mogile2 ~]$ sudo mogadm device list | grep dev20
dev2001: drain 2018.942 731.216 2750.158
dev2002: drain 2022.452 727.706 2750.158
dev2003: drain 2022.311 727.848 2750.158
dev2004: drain 2022.211 727.947 2750.158
dev2005: drain 1472.550 1277.608 2750.158
dev2006: drain 2022.135 728.023 2750.158
dev2007: drain 2022.139 728.020 2750.158
dev2008: drain 2022.246 727.912 2750.158
dev2009: drain 2022.369 727.789 2750.158
dev2010: drain 2022.191 727.967 2750.158
dev2011: drain 2022.694 727.464 2750.158
dev2012: drain 2022.256 727.902 2750.158
dev2013: drain 2022.117 728.041 2750.158
dev2014: drain 2022.271 727.887 2750.158
dev2015: drain 2021.590 728.568 2750.158
dev2016: drain 2021.499 728.659 2750.158
dev2017: drain 2021.712 728.446 2750.158
dev2018: drain 2021.191 728.967 2750.158
dev2019: drain 2020.846 729.312 2750.158
dev2020: drain 2021.758 728.400 2750.158
dev2021: drain 2021.490 728.668 2750.158
dev2022: drain 2021.217 728.941 2750.158
dev2023: drain 2020.922 729.236 2750.158
dev2024: drain 2019.909 730.249 2750.158
dev2025: drain 2020.503 729.655 2750.158
dev2026: drain 2020.807 729.352 2750.158
dev2027: drain 2021.056 729.103 2750.158
dev2028: drain 2020.487 729.671 2750.158
dev2029: alive 182.120 10818.996 11001.116
dev2030: alive 184.549 10816.567 11001.116
dev2031: alive 185.268 10815.849 11001.116
dev2032: alive 182.004 10819.112 11001.116
dev2033: alive 189.295 10811.821 11001.116
dev2034: alive 183.199 10817.917 11001.116
dev2035: alive 178.625 10822.491 11001.116
dev2036: alive 180.549 10820.567 11001.116
We have already tried tuning queue_rate_for_rebal
, queue_size_for_rebal
, and the replicate workers.
We did this once using zones across two data centers and the replication was MUCH faster. We were hoping that a rebalance would work much like replication. But at this rate, it seems like marking the old devices as dead to for replication of fids would be faster.
Are there any other ways to speed up a rebalance (such as multiple devices at once) without having to mark devices as dead?