Our system uses a custom algorithm to confirm every outage before beginning any notifications. After we first detect a failed service check, confirmation checks are performed from a number of geographically nearby monitoring locations, and only after a quorum of these locations confirm the outage does the notification process begin.
Once the service has been restored and is returning successful checks, the same process is repeated to confirm the resolution. With this approach, any spurious failures are eliminated and alerts are only sent on real, confirmed problems.
Due to the design of our system, the entire confirmation process is handled quite quickly, typically within 20 seconds of the initial failure detection.