0

Environment: GKE, 6 nodes, each nodes have 16GB ram (shared \w other pods) and 4 Core (also shared) Mongodb deployment: bitnami helm version 13.5.X - replicaset type (3 worker and 1 arbiter)

I was trying to remove alot of dirty data (100000 docs with each estimated about 2kb in size) on my primary mongodb cluster via port-forwarding (due to the past experience that everytime i tried to port-forward {even with direct/primary prefered options} via the kubernetes service, it ended up connecting to the secondary rs).

Albeit (when running the delete query) i was foolish enough to think that i have sufficient resources on running this operation (since it was shared). And now my replicaset is on a crashloop because its experiencing a slow query. From my understanding (cmiiw..) ,when the secondary tries to sync their query log, the don't have enough resources.

Current pods status

NAME                                      READY   STATUS                   RESTARTS          AGE
mongodb-0                                 0/1     CrashLoopBackOff         377 (80s ago)     36h
mongodb-1                                 0/1     Running                  1 (134m ago)      27h
mongodb-2                                 0/1     ContainerStatusUnknown   59 (6h ago)       11h
mongodb-arbiter-0                         1/1     Running                  309 (5m40s ago)   2d1h

Log on mongodb-0

{"t":{"$date":"2023-07-21T05:38:30.240+00:00"},"s":"I",  "c":"REPL",     "id":21550,   "ctx":"initandlisten","msg":"Replaying stored operations from startPoint (exclusive) to endPoint (inclusive)","attr":{"startPoint":{"$timestamp":{"t":1689641543,"i":5857}},"endPoint":{"$timestamp":{"t":1689641885,"i":1}}}}
{"t":{"$date":"2023-07-21T05:38:30.377+00:00"},"s":"I",  "c":"-",        "id":4939300, "ctx":"monitoring-keys-for-HMAC","msg":"Failed to refresh key cache","attr":{"error":"ReadConcernMajorityNotAvailableYet: Read concern majority reads are currently not possible.","nextWakeupMillis":1200}}
{"t":{"$date":"2023-07-21T05:38:30.415+00:00"},"s":"I",  "c":"COMMAND",  "id":51803,   "ctx":"initandlisten","msg":"Slow query","attr":{"type":"command","ns":"local.oplog.rs","command":{"getMore":151664126847388003,"collection":"oplog.rs","$db":"local"},"originatingCommand":{"find":"oplog.rs","filter":{"ts":{"$gte":{"$timestamp":{"t":1689641543,"i":5857}},"$lte":{"$timestamp":{"t":1689641885,"i":1}}}},"readConcern":{},"$db":"local"},"planSummary":"COLLSCAN","cursorid":151664126847388003,"keysExamined":0,"docsExamined":100510,"numYields":100,"nreturned":100509,"queryHash":"23904D31","planCacheKey":"23904D31","reslen":16777108,"locks":{"ParallelBatchWriterMode":{"acquireCount":{"r":29}},"FeatureCompatibilityVersion":{"acquireCount":{"r":123,"w":18}},"ReplicationStateTransition":{"acquireCount":{"w":36}},"Global":{"acquireCount":{"r":123,"w":13,"W":5}},"Database":{"acquireCount":{"r":15,"w":12,"W":1}},"Collection":{"acquireCount":{"r":19,"w":4,"W":4}},"Mutex":{"acquireCount":{"r":34}},"oplog":{"acquireCount":{"w":1}}},"flowControl":{"acquireCount":10,"timeAcquiringMicros":19},"readConcern":{"provenance":"implicitDefault"},"storage":{"data":{"bytesRead":368803082,"timeReadingMicros":5070664},"timeWaitingMicros":{"schemaLock":628}},"protocol":"op_msg","durationMillis":174}}
{"t":{"$date":"2023-07-21T05:38:31.581+00:00"},"s":"I",  "c":"-",        "id":4939300, "ctx":"monitoring-keys-for-HMAC","msg":"Failed to refresh key cache","attr":{"error":"ReadConcernMajorityNotAvailableYet: Read concern majority reads are currently not possible.","nextWakeupMillis":1400}}

I have tried to access the cluster via port-forward to configure the oplog size so that (i hope it would have enough resouce) to sync their query logs. But to avail with no success (since the pods are on an infinite loop of crash...). Though i am not sure, if this is the correct solution (from https://www.mongodb.com/community/forums/t/alert-replication-oplog-window-has-gone-below-1-hours/114043/2) since configuring this means that the mongo cluster needs more resources.

It would be a pleasure to have a suggestion on how to handle such problem. Thanks!

Zaki
  • 1
  • 2
  • First - please post (redacted-as-needed) _text_ rather than images of errors and log messages. Second - What do you mean by crash loop? What errors etc are associated with that? The single entry for a slow query (still under 0.15 seconds) is _not_ a fatal (or perhaps even noticeable) problem on its own by any means – user20042973 Jul 20 '23 at 18:19
  • @user20042973 thanks for the direction, i've edited the question. – Zaki Jul 21 '23 at 05:47

0 Answers0