Topic 1 Question 59
You encounter a large number of outages in the production systems you support. You receive alerts for all the outages that wake you up at night. The alerts are due to unhealthy systems that are automatically restarted within a minute. You want to set up a process that would prevent staff burnout while following Site Reliability Engineering practices. What should you do?
Eliminate unactionable alerts.
Create an incident report for each of the alerts.
Distribute the alerts to engineers in different time zones.
Redefine the related Service Level Objective so that the error budget is not exhausted.
ユーザの投票
コメント(15)
I reckon its A, the reason is because it seems like the problem is automatically fixed with an restart of the service after a minute, therefore engineers don't really need to be woken up about these problems. If it failed multiple times or if the restart failed, then the engineer should be woken up
👍 13AL122021/11/01- 👍 4TNT872021/10/28
- 正解だと思う選択肢: A
agree with kyubiblaze about having to remove unactionable items aka spam: "good monitoring alerts on actionable problems" @ https://cloud.google.com/blog/products/management-tools/meeting-reliability-challenges-with-sre-principles
👍 4zygomar2022/02/21
シャッフルモード