I'm running into an issue where one of my shards is constantly at 100% CPU usage while I'm storing files into my Mongo DB (using Grid FS). I have shutdown writing to the DB and the usage does drop down to nearly 0%. However, the auto balancer is on and does not appear to be auto balancing anything. I have roughly 50% of my data on that one shard with nearly 100% CPU usage and virtually all the others are at 7-8%.
Any ideas?
mongos> version()
3.0.6
Auto Balancing Enabled
Storage Engine: WiredTiger
I have this general architecture:
2 - routers
3 - config server
8 - shards (2 shards per server - 4 servers)
No replica sets!
https://docs.mongodb.org/v3.0/core/sharded-cluster-architectures-production/
Log Details
Router 1 Log:
2016-01-15T16:15:21.714-0700 I NETWORK [conn3925104] end connection [IP]:[port] (63 connections now open)
2016-01-15T16:15:23.256-0700 I NETWORK [LockPinger] Socket recv() timeout [IP]:[port]
2016-01-15T16:15:23.256-0700 I NETWORK [LockPinger] SocketException: remote: [IP]:[port] error: 9001 socket exception [RECV_TIMEOUT] server [IP]:[port]
2016-01-15T16:15:23.256-0700 I NETWORK [LockPinger] DBClientCursor::init call() failed
2016-01-15T16:15:23.256-0700 I NETWORK [LockPinger] scoped connection to [IP]:[port],[IP]:[port],[IP]:[port] not being returned to the pool
2016-01-15T16:15:23.256-0700 W SHARDING [LockPinger] distributed lock pinger '[IP]:[port],[IP]:[port],[IP]:[port]/[IP]:[port]:1442579303:1804289383' detected an exception while pinging. :: caused by :: SyncClusterConnection::update prepare failed: [IP]:[port] (IP) failed:10276 DBClientBase::findN: transport error: [IP]:[port] ns: admin.$cmd query: { getlasterror: 1, fsync: 1 }
2016-01-15T16:15:24.715-0700 I NETWORK [mongosMain] connection accepted from [IP]:[port] #3925105 (64 connections now open)
2016-01-15T16:15:24.715-0700 I NETWORK [conn3925105] end connection [IP]:[port] (63 connections now open)
2016-01-15T16:15:27.717-0700 I NETWORK [mongosMain] connection accepted from [IP]:[port] #3925106 (64 connections now open)
2016-01-15T16:15:27.718-0700 I NETWORK [conn3925106] end connection [IP]:[port](63 connections now open)
Router 2 Log:
2016-01-15T16:18:21.762-0700 I SHARDING [Balancer] distributed lock 'balancer/[IP]:[port]:1442579454:1804289383' acquired, ts : 56997e3d110ccb8e38549a9d
2016-01-15T16:18:24.316-0700 I SHARDING [LockPinger] cluster [IP]:[port],[IP]:[port],[IP]:[port] pinged successfully at Fri Jan 15 16:18:24 2016 by distributed lock pinger '[IP]:[port],[IP]:[port],[IP]:[port]/[IP]:[port]:1442579454:1804289383', sleeping for 30000ms
2016-01-15T16:18:24.978-0700 I SHARDING [Balancer] distributed lock 'balancer/[IP]:[port]:1442579454:1804289383' unlocked.
2016-01-15T16:18:35.295-0700 I SHARDING [Balancer] distributed lock 'balancer/[IP]:[port]:1442579454:1804289383' acquired, ts : 56997e4a110ccb8e38549a9f
2016-01-15T16:18:38.507-0700 I SHARDING [Balancer] distributed lock 'balancer/[IP]:[port]:1442579454:1804289383' unlocked.
2016-01-15T16:18:48.838-0700 I SHARDING [Balancer] distributed lock 'balancer/[IP]:[port]:1442579454:1804289383' acquired, ts : 56997e58110ccb8e38549aa1
2016-01-15T16:18:52.038-0700 I SHARDING [Balancer] distributed lock 'balancer/[IP]:[port]:1442579454:1804289383' unlocked.
2016-01-15T16:18:54.660-0700 I SHARDING [LockPinger] cluster [IP]:[port],[IP]:[port],[IP]:[port] pinged successfully at Fri Jan 15 16:18:54 2016 by distributed lock pinger '[IP]:[port],[IP]:[port],[IP]:[port]/[IP]:[port]:1442579454:1804289383', sleeping for 30000ms
2016-01-15T16:19:02.323-0700 I SHARDING [Balancer] distributed lock 'balancer/[IP]:[port]:1442579454:1804289383' acquired, ts : 56997e66110ccb8e38549aa3
2016-01-15T16:19:05.513-0700 I SHARDING [Balancer] distributed lock 'balancer/[IP]:[port]:1442579454:1804289383' unlocked.
Problematic Shard Log:
2016-01-15T16:21:03.426-0700 W SHARDING [conn40] Finding the split vector for Files.fs.chunks over { files_id: 1.0, n: 1.0 } keyCount: 137 numSplits: 200715 lookedAt: 46 took 17364ms
2016-01-15T16:21:03.484-0700 I COMMAND [conn40] command admin.$cmd command: splitVector { splitVector: "Files.fs.chunks", keyPattern: { files_id: 1.0, n: 1.0 }, min: { files_id: ObjectId('5650816c827928d710ef5ef9'), n: 1 }, max: { files_id: MaxKey, n: MaxKey }, maxChunkSizeBytes: 67108864, maxSplitPoints: 0, maxChunkObjects: 250000 } ntoreturn:1 keyUpdates:0 writeConflicts:0 numYields:216396 reslen:8318989 locks:{ Global: { acquireCount: { r: 432794 } }, Database: { acquireCount: { r: 216397 } }, Collection: { acquireCount: { r: 216397 } } } 17421ms
2016-01-15T16:21:03.775-0700 I SHARDING [LockPinger] cluster [IP]:[port],[IP]:[port],[IP]:[port] pinged successfully at Fri Jan 15 16:21:03 2016 by distributed lock pinger '[IP]:[port],[IP]:[port],[IP]:[port]/[IP]:[port]:1441718306:765353801', sleeping for 30000ms
2016-01-15T16:21:04.321-0700 I SHARDING [conn40] request split points lookup for chunk Files.fs.chunks { : ObjectId('5650816c827928d710ef5ef9'), : 1 } -->> { : MaxKey, : MaxKey }
2016-01-15T16:21:08.243-0700 I SHARDING [conn46] request split points lookup for chunk Files.fs.chunks { : ObjectId('5650816c827928d710ef5ef9'), : 1 } -->> { : MaxKey, : MaxKey }
2016-01-15T16:21:10.174-0700 W SHARDING [conn37] Finding the split vector for Files.fs.chunks over { files_id: 1.0, n: 1.0 } keyCount: 137 numSplits: 200715 lookedAt: 60 took 18516ms
2016-01-15T16:21:10.232-0700 I COMMAND [conn37] command admin.$cmd command: splitVector { splitVector: "Files.fs.chunks", keyPattern: { files_id: 1.0, n: 1.0 }, min: { files_id: ObjectId('5650816c827928d710ef5ef9'), n: 1 }, max: { files_id: MaxKey, n: MaxKey }, maxChunkSizeBytes: 67108864, maxSplitPoints: 0, maxChunkObjects: 250000 } ntoreturn:1 keyUpdates:0 writeConflicts:0 numYields:216396 reslen:8318989 locks:{ Global: { acquireCount: { r: 432794 } }, Database: { acquireCount: { r: 216397 } }, Collection: { acquireCount: { r: 216397 } } } 18574ms
2016-01-15T16:21:10.989-0700 W SHARDING [conn25] Finding the split vector for Files.fs.chunks over { files_id: 1.0, n: 1.0 } keyCount: 137 numSplits: 200715 lookedAt: 62 took 18187ms
2016-01-15T16:21:11.047-0700 I COMMAND [conn25] command admin.$cmd command: splitVector { splitVector: "Files.fs.chunks", keyPattern: { files_id: 1.0, n: 1.0 }, min: { files_id: ObjectId('5650816c827928d710ef5ef9'), n: 1 }, max: { files_id: MaxKey, n: MaxKey }, maxChunkSizeBytes: 67108864, maxSplitPoints: 0, maxChunkObjects: 250000 } ntoreturn:1 keyUpdates:0 writeConflicts:0 numYields:216396 reslen:8318989 locks:{ Global: { acquireCount: { r: 432794 } }, Database: { acquireCount: { r: 216397 } }, Collection: { acquireCount: { r: 216397 } } } 18246ms
2016-01-15T16:21:11.365-0700 I SHARDING [conn37] request split points lookup for chunk Files.fs.chunks { : ObjectId('5650816c827928d710ef5ef9'), : 1 } -->> { : MaxKey, : MaxKey }