Inbound & Outbound Calling Service Issues

Incident Report for Cytracom

Postmortem

On November 17th starting at 9:13 a.m. and ending at 10:34 a.m., Cytracom users experienced delayed and intermittent call completion caused by reaching capacity limits within our internal DNS resolution cluster (service discovery). This was due to a failure to accurately monitor and manage this service's capacity. We have since added additional layers of monitoring and are implementing a permanent fix this weekend which also enables better alerting and capacity planning in the future.

In addition to these immediate fixes, we will be pursuing a review and analysis of our failover capabilities.

Common Questions:

Why was there an increase and overload of the service?
We have experienced massive growth in the use of our services and continually monitor, plan, and increase system capacity. Early limits were placed on the platform's internal DNS services which we failed to aggressively monitor. In addition, during the timeframe of this issue, we observed an attempted toll-fraud attack which heavily contributed to the system and request load.

Why didn't Voice Continuity Policies trigger?
Voice Continuity (VC) (https://help.cytracom.com/hc/en-us/articles/360000243323)) is a feature that allows a customer that is experiencing a local network outage to have inbound calls directed to an alternate, pre-defined endpoint on a good and functioning network. In short, VC failover solves for localized customer network and ISP issues.

Why didn't failover kick in for the Active/Active datacenter?
Our Active/Active failover is currently designed to trigger based on system and region failures rather than delays. We will be analyzing this service impairment and how our Active/Active architecture can be leveraged to mitigate user impact due to service degradation or delays.

We appreciate the ongoing feedback and suggestions from our partners and customers in response to this incident. Our mission as an engineering organization is to empower our partners and customers with the most reliable cloud-communication solutions. We failed to deliver in this case and for that we are deeply sorry. We will continue to apply learnings from this incident as we work to further improve resiliency within our platform.

Posted Nov 19, 2020 - 13:04 CST

Resolved

The incident has been resolved. We will be continuing our analysis to complete a post mortem within 48 hours.

Posted Nov 17, 2020 - 16:14 CST

Monitoring

We have mitigated the issue that affected call services, and we will continue monitoring the situation.

Posted Nov 17, 2020 - 11:38 CST

Identified

The issue has been identified and a fix is being implemented. The inbound & outbound calls are intermittently processing at this time.

Posted Nov 17, 2020 - 11:09 CST

Update

We are continuing our investigation into issues with inbound & outbound call services. We will provide further updates in 30 minutes.

Posted Nov 17, 2020 - 10:35 CST

Investigating

We are currently investigating an issue that is impacting inbound & outbound calling services. We will continue to post updates here as soon as they come available.

Posted Nov 17, 2020 - 09:39 CST

This incident affected: Voice Services (Call Completion).