On November 17th starting at 9:13 a.m. and ending at 10:34 a.m., Cytracom users experienced delayed and intermittent call completion caused by reaching capacity limits within our internal DNS resolution cluster (service discovery). This was due to a failure to accurately monitor and manage this service's capacity. We have since added additional layers of monitoring and are implementing a permanent fix this weekend which also enables better alerting and capacity planning in the future.
In addition to these immediate fixes, we will be pursuing a review and analysis of our failover capabilities.
Common Questions:
Why was there an increase and overload of the service?
We have experienced massive growth in the use of our services and continually monitor, plan, and increase system capacity. Early limits were placed on the platform's internal DNS services which we failed to aggressively monitor. In addition, during the timeframe of this issue, we observed an attempted toll-fraud attack which heavily contributed to the system and request load.
Why didn't Voice Continuity Policies trigger?
Voice Continuity (VC) (https://help.cytracom.com/hc/en-us/articles/360000243323)) is a feature that allows a customer that is experiencing a local network outage to have inbound calls directed to an alternate, pre-defined endpoint on a good and functioning network. In short, VC failover solves for localized customer network and ISP issues.
Why didn't failover kick in for the Active/Active datacenter?
Our Active/Active failover is currently designed to trigger based on system and region failures rather than delays. We will be analyzing this service impairment and how our Active/Active architecture can be leveraged to mitigate user impact due to service degradation or delays.
We appreciate the ongoing feedback and suggestions from our partners and customers in response to this incident. Our mission as an engineering organization is to empower our partners and customers with the most reliable cloud-communication solutions. We failed to deliver in this case and for that we are deeply sorry. We will continue to apply learnings from this incident as we work to further improve resiliency within our platform.