This document in the Google Cloud Architecture Framework provides operational principles to create alerts that help you run reliable services. The more information you have on how your service performs, the more informed your decisions are when there's an issue. Design your alerts for early and accurate detection of all user-impacting system problems, and minimize false positives.
Optimize the alert delay
There's a balance between alerts that are sent too soon that stress the operations team and alerts that are sent too late and cause long service outages. Tune the alert delay before the monitoring system notifies humans of a problem to minimize time to detect, while maximizing signal versus noise. Use the error budget consumption rate to derive the optimal alert configuration.
Alert on symptoms rather than causes
Trigger alerts based on the direct impact to user experience. Noncompliance with global or per-customer SLOs indicates a direct impact. Don't alert on every possible root cause of a failure, especially when the impact is limited to a single replica. A well-designed distributed system recovers seamlessly from single-replica failures.
Alert on outlier values rather than averages
When monitoring latency, define SLOs and set alerts for (pick two out of three) 90th, 95th, or 99th percentile latency, not for average or 50th percentile latency. Good mean or median latency values can hide unacceptably high values at the 90th percentile or above that cause very bad user experiences. Therefore you should apply this principle of alerting on outlier values when monitoring latency for any critical operation, such as a request-response interaction with a webserver, batch completion in a data processing pipeline, or a read or write operation on a storage service.