1

I'm running into an issue where one of my shards is constantly at 100% CPU usage while I'm storing files into my Mongo DB (using Grid FS). I have shutdown writing to the DB and the usage does drop down to nearly 0%. However, the auto balancer is on and does not appear to be auto balancing anything. I have roughly 50% of my data on that one shard with nearly 100% CPU usage and virtually all the others are at 7-8%.

Any ideas?

mongos> version()
3.0.6

Auto Balancing Enabled

Storage Engine: WiredTiger

I have this general architecture:
2 - routers
3 - config server
8 - shards (2 shards per server - 4 servers)
No replica sets!

https://docs.mongodb.org/v3.0/core/sharded-cluster-architectures-production/

Log Details

Router 1 Log:

2016-01-15T16:15:21.714-0700 I NETWORK  [conn3925104] end connection [IP]:[port] (63 connections now open)
2016-01-15T16:15:23.256-0700 I NETWORK  [LockPinger] Socket recv() timeout  [IP]:[port]
2016-01-15T16:15:23.256-0700 I NETWORK  [LockPinger] SocketException: remote: [IP]:[port] error: 9001 socket exception [RECV_TIMEOUT] server [IP]:[port]
2016-01-15T16:15:23.256-0700 I NETWORK  [LockPinger] DBClientCursor::init call() failed
2016-01-15T16:15:23.256-0700 I NETWORK  [LockPinger] scoped connection to [IP]:[port],[IP]:[port],[IP]:[port] not being returned to the pool
2016-01-15T16:15:23.256-0700 W SHARDING [LockPinger] distributed lock pinger '[IP]:[port],[IP]:[port],[IP]:[port]/[IP]:[port]:1442579303:1804289383' detected an exception while pinging. :: caused by :: SyncClusterConnection::update prepare failed:  [IP]:[port] (IP) failed:10276 DBClientBase::findN: transport error: [IP]:[port] ns: admin.$cmd query: { getlasterror: 1, fsync: 1 }
2016-01-15T16:15:24.715-0700 I NETWORK  [mongosMain] connection accepted from [IP]:[port] #3925105 (64 connections now open)
2016-01-15T16:15:24.715-0700 I NETWORK  [conn3925105] end connection [IP]:[port] (63 connections now open)
2016-01-15T16:15:27.717-0700 I NETWORK  [mongosMain] connection accepted from [IP]:[port] #3925106 (64 connections now open)
2016-01-15T16:15:27.718-0700 I NETWORK  [conn3925106] end connection [IP]:[port](63 connections now open)

Router 2 Log:

2016-01-15T16:18:21.762-0700 I SHARDING [Balancer] distributed lock 'balancer/[IP]:[port]:1442579454:1804289383' acquired, ts : 56997e3d110ccb8e38549a9d
2016-01-15T16:18:24.316-0700 I SHARDING [LockPinger] cluster [IP]:[port],[IP]:[port],[IP]:[port] pinged successfully at Fri Jan 15 16:18:24 2016 by distributed lock pinger '[IP]:[port],[IP]:[port],[IP]:[port]/[IP]:[port]:1442579454:1804289383', sleeping for 30000ms
2016-01-15T16:18:24.978-0700 I SHARDING [Balancer] distributed lock 'balancer/[IP]:[port]:1442579454:1804289383' unlocked.
2016-01-15T16:18:35.295-0700 I SHARDING [Balancer] distributed lock 'balancer/[IP]:[port]:1442579454:1804289383' acquired, ts : 56997e4a110ccb8e38549a9f
2016-01-15T16:18:38.507-0700 I SHARDING [Balancer] distributed lock 'balancer/[IP]:[port]:1442579454:1804289383' unlocked.
2016-01-15T16:18:48.838-0700 I SHARDING [Balancer] distributed lock 'balancer/[IP]:[port]:1442579454:1804289383' acquired, ts : 56997e58110ccb8e38549aa1
2016-01-15T16:18:52.038-0700 I SHARDING [Balancer] distributed lock 'balancer/[IP]:[port]:1442579454:1804289383' unlocked.
2016-01-15T16:18:54.660-0700 I SHARDING [LockPinger] cluster [IP]:[port],[IP]:[port],[IP]:[port] pinged successfully at Fri Jan 15 16:18:54 2016 by distributed lock pinger '[IP]:[port],[IP]:[port],[IP]:[port]/[IP]:[port]:1442579454:1804289383', sleeping for 30000ms
2016-01-15T16:19:02.323-0700 I SHARDING [Balancer] distributed lock 'balancer/[IP]:[port]:1442579454:1804289383' acquired, ts : 56997e66110ccb8e38549aa3
2016-01-15T16:19:05.513-0700 I SHARDING [Balancer] distributed lock 'balancer/[IP]:[port]:1442579454:1804289383' unlocked.

Problematic Shard Log:

2016-01-15T16:21:03.426-0700 W SHARDING [conn40] Finding the split vector for Files.fs.chunks over { files_id: 1.0, n: 1.0 } keyCount: 137 numSplits: 200715 lookedAt: 46 took 17364ms
2016-01-15T16:21:03.484-0700 I COMMAND  [conn40] command admin.$cmd command: splitVector { splitVector: "Files.fs.chunks", keyPattern: { files_id: 1.0, n: 1.0 }, min: { files_id: ObjectId('5650816c827928d710ef5ef9'), n: 1 }, max: { files_id: MaxKey, n: MaxKey }, maxChunkSizeBytes: 67108864, maxSplitPoints: 0, maxChunkObjects: 250000 } ntoreturn:1 keyUpdates:0 writeConflicts:0 numYields:216396 reslen:8318989 locks:{ Global: { acquireCount: { r: 432794 } }, Database: { acquireCount: { r: 216397 } }, Collection: { acquireCount: { r: 216397 } } } 17421ms
2016-01-15T16:21:03.775-0700 I SHARDING [LockPinger] cluster [IP]:[port],[IP]:[port],[IP]:[port] pinged successfully at Fri Jan 15 16:21:03 2016 by distributed lock pinger '[IP]:[port],[IP]:[port],[IP]:[port]/[IP]:[port]:1441718306:765353801', sleeping for 30000ms
2016-01-15T16:21:04.321-0700 I SHARDING [conn40] request split points lookup for chunk Files.fs.chunks { : ObjectId('5650816c827928d710ef5ef9'), : 1 } -->> { : MaxKey, : MaxKey }
2016-01-15T16:21:08.243-0700 I SHARDING [conn46] request split points lookup for chunk Files.fs.chunks { : ObjectId('5650816c827928d710ef5ef9'), : 1 } -->> { : MaxKey, : MaxKey }
2016-01-15T16:21:10.174-0700 W SHARDING [conn37] Finding the split vector for Files.fs.chunks over { files_id: 1.0, n: 1.0 } keyCount: 137 numSplits: 200715 lookedAt: 60 took 18516ms
2016-01-15T16:21:10.232-0700 I COMMAND  [conn37] command admin.$cmd command: splitVector { splitVector: "Files.fs.chunks", keyPattern: { files_id: 1.0, n: 1.0 }, min: { files_id: ObjectId('5650816c827928d710ef5ef9'), n: 1 }, max: { files_id: MaxKey, n: MaxKey }, maxChunkSizeBytes: 67108864, maxSplitPoints: 0, maxChunkObjects: 250000 } ntoreturn:1 keyUpdates:0 writeConflicts:0 numYields:216396 reslen:8318989 locks:{ Global: { acquireCount: { r: 432794 } }, Database: { acquireCount: { r: 216397 } }, Collection: { acquireCount: { r: 216397 } } } 18574ms
2016-01-15T16:21:10.989-0700 W SHARDING [conn25] Finding the split vector for Files.fs.chunks over { files_id: 1.0, n: 1.0 } keyCount: 137 numSplits: 200715 lookedAt: 62 took 18187ms
2016-01-15T16:21:11.047-0700 I COMMAND  [conn25] command admin.$cmd command: splitVector { splitVector: "Files.fs.chunks", keyPattern: { files_id: 1.0, n: 1.0 }, min: { files_id: ObjectId('5650816c827928d710ef5ef9'), n: 1 }, max: { files_id: MaxKey, n: MaxKey }, maxChunkSizeBytes: 67108864, maxSplitPoints: 0, maxChunkObjects: 250000 } ntoreturn:1 keyUpdates:0 writeConflicts:0 numYields:216396 reslen:8318989 locks:{ Global: { acquireCount: { r: 432794 } }, Database: { acquireCount: { r: 216397 } }, Collection: { acquireCount: { r: 216397 } } } 18246ms
2016-01-15T16:21:11.365-0700 I SHARDING [conn37] request split points lookup for chunk Files.fs.chunks { : ObjectId('5650816c827928d710ef5ef9'), : 1 } -->> { : MaxKey, : MaxKey }
Valijon
  • 12,667
  • 4
  • 34
  • 67
afghanbloodhound
  • 571
  • 1
  • 4
  • 8
  • Provide please how do you connect to your `mongos` and how did your shards are defined. BTW, if you read carefully, MongoDB says: `In a production cluster, you must ensure that data is redundant and that your systems are highly available.`. So, there is no any `balancer`, it only provides `high availability`. – Valijon Jan 15 '16 at 23:50
  • How I connect? For writing to Mongo, C# API ---- How my shards are defined? I'm not entirely sure what you mean. I have 8 - 10TB shards. I use the Files database which automatically create the fs.files & fs.chunks collections - are you interested in shard key details? ----- My Mongo cluster are all virtualized, the virtualized environment resides on a raided storage disk array. if one fails, i can recover so replica sets are totally redundant for my circumstance. – afghanbloodhound Jan 16 '16 at 00:03
  • Yes, I mean shard keys. If shard key(s) is not defined correctly, MongoDB will not distribute files over all mongos. About connection, I mean, did you use mongodb URI connection? – Valijon Jan 16 '16 at 00:13
  • From what I can see, you [chose the wrong shard key](https://docs.mongodb.org/manual/tutorial/choose-a-shard-key/), causing the majority of data to be written to one shard. If you have several TB of data: Get yourself a DBA, because that's going to take a while to get right. Basically, you need to dump the data, delete the old collection, set a new one up with a proper shard key and restore the data to the new collection. There are some nuts and bolts to that procedure, however. – Markus W Mahlberg Jan 16 '16 at 11:55
  • 1
    please upgrade ASAP to 3.0.8 as there is a wiredTiger bug that was fixed that could impact migrations. – Asya Kamsky Jan 17 '16 at 01:09
  • shard key: { "files_id" : 1, "n" : 1 } – afghanbloodhound Jan 18 '16 at 16:11
  • where files_id is auto generated - regardless of what your shard key is, shouldn't the auto balancer redistribute the data? isn't that the whole purpose of auto balancing? it appears that balancing is just not happening. – afghanbloodhound Jan 18 '16 at 16:33
  • Mongo URI connection - Yes, I do use that format...in which only one of the routers was specified. i need to change that to specify both. i'm not sure of the exact syntax. i know i need the router IP:port + some sort of options tag? – afghanbloodhound Jan 18 '16 at 16:50
  • I upgraded it Mongo v.3.2.x the splitting error I was getting has been resolved. I still have the uneven distribution. I find it hard to believe it is the shard key, since I am using the md5 checksum of the file I'm submitting to the DB. I'm also using ranged based partitioning, I'm thinking I should move to hash based. – afghanbloodhound Apr 18 '16 at 17:21

2 Answers2

0

For the splitting error - Upgrading to Mongo v.3.0.8+ resolved it

Still having an issue with the balancing itself...shard key is an md5 check sum so unless they all have very similar md5s (not very likely) there is still investigating to do....using range based partitioning

afghanbloodhound
  • 571
  • 1
  • 4
  • 8
0

there are multiple ways to check

  1. db.printShardingStatus() - this will give all collections sharded and whether auto balancer is on and current collection taken for sharding from when

  2. sh.status(true) - this will give chunk level details. Look whether your chunk has jumbo:true . In case chunk is marked as jumbo it will not be split properly.

  3. db.collection.stats() -- this will give collection stats and see each shard distribution details there

Mahesh Malpani
  • 1,782
  • 16
  • 27