Questions tagged [sre]

Site Reliability Engineering (SRE), a reliability focused implementation of DevOps.

Highest level concern is to design, build and support software with "ever-watchful eye on system availability, latency, performance, and capacity".

SRE has started at Google but has now been adopted by several other companies.

49 questions
0
votes
0 answers

Grafana data source not found

Everyone. I started using grafana to make it a one stop dashboard for monitoring our infra. I have multiple prometheus sources and I was able to successfully add them to grafana, however, queries are not able to access the data source through proxy…
Swaroop Kundeti
  • 515
  • 4
  • 11
  • 25
0
votes
0 answers

What are the 4 golden points to monitor Jenkins E2E server?

I have a task to monitor 4 golden signals of Jenkins E2E server. I already configured Latency, Network Throughput & Errors. Please let me know what metrics I should include along with other 3 metrics?
NITS
  • 11
  • 3
0
votes
2 answers

What happens if container exceeds cpu requested but under limit on kubernetes?

In kubernetes we can set limits and requests for cpu. If the container exceeds the limit, from my understanding it will be throttled. However if the container exceeds the requested but is still under the limit what would happen? Would I see…
user2962698
  • 115
  • 2
  • 9
0
votes
1 answer

give access to service principal which is in another azure tenant

we deploy resources in our Azure tenant through Jenkins which uses terraform to provision infra resources. and we use service principal for authentication and infra provisioning which are in same tenant. in our infra deployment we also create VNET…
0
votes
1 answer

How to add multiple AWS ClientVPN Routes using Terraform

I have AWS clientVPN which was created manually from AWS console and it has around 20 plus route table entry. Now, I want to terraform this so we can add any new route using terraform. I have imported the ClientVPN information using terraform…
0
votes
1 answer

Prometheus inhibit alert selectively

I need to create an alerting system that has to notify when a particular condition (e.g. Tomcat goes down) is met. Multiple remote servers deployed in different locations (with different time zones) host Tomcat services and are being monitored by…
0
votes
1 answer

Anchore Container scanning in Jenkins CI Pipeline

I need help with my Jenkinsfile CI file. Code in Jenknsfile looks like this: pipeline { environment { registry = "user/demo1" registryCredential = 'dockerhub' dockerImage = '' } agent any stages { stage('Building image') { …
Rukender
  • 49
  • 2
  • 7
0
votes
2 answers

Prometheus rules - check file count inside a directory of an app container

I'm looking to write a prometheus rule to constantly check for message queue length(exim mail relay) which is the total number of files in a directory in an app's container and alert a slack channel via alert manager. Is this possible at all with…
Avi
  • 1,453
  • 4
  • 18
  • 43
0
votes
1 answer

prometheus alert expression for 99% availability of rest API

I would like to create an alert in Prometheus for a REST API, if the API is not available 99% of the time. I am new to prometheus expression. Could you please help me to create an expression to trigger this ALERT. For example if i have a counter…
user3777385
  • 31
  • 1
  • 8
0
votes
0 answers

chef recipe to check the count of processes and monitor the number of open file descriptors

I am trying to update the metricbeat_cookbook to get some required info. Monitor the number of Number of sessions and html5 client processes running each server. ps -ef | grep -i html5 | wc -l. This is the logic I need in the…
sunny
  • 1
0
votes
2 answers

Eliminate specific value from Jmx exporter through config Yaml

Here is the current Jmx exporter pattern: pattern: 'metrics<>Value' name: 'x.y.z.resilience4j.circuitbreaker.state' labels: {name: "$1", kind: "$2" } type: GAUGE Current…
Md. Hasan Basri
  • 159
  • 1
  • 15
0
votes
1 answer

Is the maintenance window burning error budget

Is the maintenance window burning error budget? Example: Let's say I have a 1h error budget left. I stop the service for planned maintenance for 30 minutes. Is the error budget still 1h or is it 30 minutes? The maintenance window is happening when…
danielinclouds
  • 347
  • 1
  • 2
  • 9
0
votes
1 answer

what are best practices for deploying new features for spring boot application?

i have a spring boot application with too many users, and there are many incoming requests to my application, what should i do for deploying a new feature to the application without losing incoming user request and actually interrupting the…
Moya
  • 21
  • 2
0
votes
1 answer

New to DevOps and CI/CD

like I said in the title I'm new to DevOps and CI/CD. I don't have much experience (except for online tutorials) and I'm looking to start a project (nothing huge) that will be using automated CI/CD pipelines for all microservices. Question is, what…
user12855411
-1
votes
1 answer

how we set name of docker network in docker-compose

write docker-compose file make multi containe. when i use docker-compose up commend it work fine but again i down docker it give **error will removing network ** Stopping sonarqube ... error Stopping note_sonatype_1 ... error Stopping…
Mayur Dagdi
  • 11
  • 1
  • 4