Questions tagged [sre]

Site Reliability Engineering (SRE), a reliability focused implementation of DevOps.

Highest level concern is to design, build and support software with "ever-watchful eye on system availability, latency, performance, and capacity".

SRE has started at Google but has now been adopted by several other companies.

49 questions
0
votes
1 answer

PromQL queries to for SLI(Service Level Indicator) indicators using prometheus/grafana and blackbox exporter

i want to achieve the specified SLI(Service Level Indicator) for our http endpoints using blackbox exporter for probing like the following indicators: 80% availability Latency less than 1s For latency i figured i can use the query…
sal
  • 33
  • 6
0
votes
1 answer

Harbor registry proxy cache vs replication

I'm new to Harbor registry. I was asked to propose an architecture for harbor in my company. I proposed at first to use an architecture based on proxy cache. But the CISO refused to use proxy cache for the entreprise without saying why. I proposed…
0
votes
0 answers

Does anyone have dataset that can be used for root cause analysis?

I need a lot of data to build a knowledge graph Our team is trying to build a knowledge map but there is enough knowledge data
zc lu
  • 1
0
votes
1 answer

Application Monitoring using sql and shell script

we are using shell scripts and sql queries to monitor our application. we are planning to migrate to cloud and use prometheus and openserch for monitoring. Is there a way to execute oracle sql quires(get the number of active users etc) and store…
0
votes
0 answers

How to create Grafana alert for when backup failed?

We have 3 PostgreSQL databases in GCP's CloudSQL, all three of them are backed-up daily. I need to use Grafana to monitor those back-ups and alert when they've failed. Unfortunately I'm not finding many resources to help me with this task. Is there…
0
votes
0 answers

Can Sonarqube check if code is Observable or not?

As there is demand of a reliable software systems i.e. site should be reliable (SRE), there is need to check if the code is observable or not ? Sonarqube has any rules to check for the same or not ?
UmeshPathak
  • 145
  • 1
  • 2
  • 13
0
votes
0 answers

RPN (Risk Priority Number) of FMEA Analysis and SLO

One of the concepts in FMEA ( Failure Mode & Effect Analysis ) is the RPN (Risk Profile Number) which decides how to prioritize your actions for addressing those failures. However, going by just severity, probability and effectiveness of control…
kembhootha
  • 83
  • 5
0
votes
1 answer

Should an not found or empty response be always 404?

I have an endpoint for a REST API that checks for the existence of a (or a list of) requests. It can return 200 OK if there is an order in progress or 404 NOT FOUND if there are no current orders Creating an availability SLO for this API, I noticed…
0
votes
0 answers

Consul Serf Health Status

I have installed on my localhost, a consul server (leader) with an IP address of 192.168.48.1 => running ok Then I installed a vagrant box (ubuntu 20.04) as a consul agent, with an ip address of 10.0.2.15 and I informed about the bridge within the…
YoussHark
  • 558
  • 1
  • 9
  • 26
0
votes
1 answer

Promethesus: How do I write a PromQL query to find the drastic increase or decrease by some X% in my graph and stays for 10m, need to raise an alert

I am trying to use rate() query like comparing last 10 min with the previous 50 min like: (sum by() rate(cmd_get{}[10m]) / (sum by() rate(cmd_get{}[50m] offset 10m)) If I want to check the percentage increase is more than 50% then what is the…
0
votes
1 answer

Alertmanager: how to send alerts only in weekdays?

I tried to add this to my alertmanager.yml in root level, but I got this error: yaml: unmarshall errors: field time_intervals not found in type config.plain time_intervals: - times: weekdays: ['monday:friday'] (I used 0.23 version of…
TestAutomator
  • 289
  • 1
  • 3
  • 14
0
votes
1 answer

How to set SLO for operations that are dependent on file size?

I have an endpoint POST /upload that uploads file into my storage. The response time is dependent on the file size (the bigger file, the longer it takes to respond with 200). How should I set a Service Level Objective (SLO) with this endpoint? Any…
NyamNyam
  • 320
  • 1
  • 3
  • 13
0
votes
1 answer

How can I OOM kill a pod manually in Kubernetes

I'm trying to manually OOM Kill pods for testing purposes, does anyone know how I can achieve this?
0
votes
1 answer

Puppet3 | read values from different yaml file

So I'm using puppet3 and I have X.yaml and Y.yaml. X.yaml has profiles::resolv_conf::nameservers: [ '1.1.1.1', '8.8.8.8', '2.2.2.2' ]in it. I want to add that [ '1.1.1.1', '8.8.8.8', '2.2.2.2' ] as a value to the servers: which is in Y.yaml: …
0
votes
1 answer

Flink 1.14.3 - [issue] failed to bind to /0.0.0.0:6123

We are using 1.14.3 version of flink and when we try to run Job manager, we are getting below exception. I tried entering akka.remote.netty.tcp.hostname = "127.0.0.1" in flink-conf.yml file and even updated IP with hostname. But didnt…