We have an ES cluster with three data nodes storing logging data as part of an ELK stack. It's running smoothly for the most part, however some of our indices have large transaction logs, e.g.:
{
"logstash-2018.05.01": {
"operations": 30706751,
"size_in_bytes": 214119243282,
"uncommitted_operations": 0,
"uncommitted_size_in_bytes": 258
},
"logstash-2018.04.30": {
"operations": 21773218,
"size_in_bytes": 150386904084,
"uncommitted_operations": 0,
"uncommitted_size_in_bytes": 301
},
"logstash-2018.05.03": {
"operations": 20829483,
"size_in_bytes": 137593564722,
"uncommitted_operations": 0,
"uncommitted_size_in_bytes": 258
},
"logstash-2018.05.04": {
"operations": 547253,
"size_in_bytes": 3573423078,
"uncommitted_operations": 253171,
"uncommitted_size_in_bytes": 1627542718
},
"logstash-2018.05.02": {
"operations": 5341,
"size_in_bytes": 29375126,
"uncommitted_operations": 5341,
"uncommitted_size_in_bytes": 29375126
}
}
From what I've read, the translogs should be cleared once the transactions are committed to the relevant shard. We seem to have a large number of operations, and therefore data, in the transaction log even though the uncommitted number of transactions are low.
This came to light when the cluster was rebalancing shards. Some shards would rebalance in minutes, however some would take hours. For example I can see that the cluster moved a 42GB shard with no translog in 17 minutes, however a 19GB shard with a translog took over 12 hours. Neither of these indices are the current daily log index.
We've not changed the translog settings from the defaults.
How can we prevent these large translogs from building up and, if they do, how can we clear them down?