July 15, 2016
Incident Report for June 24 and July 4, 2016
On June 24, 2016 and July 4, 2016 one of Datto’s primary colocation facilities in the United States experienced power outages that impacted our environment. Both of these unprecedented events occurred due to a quadruple system failure of the power infrastructure of the colocation provider. The outage caused some of our customer web services to become temporarily unavailable and affected the availability of cloud storage associated with a large percentage of our device fleet, however no customer data was compromised or lost.
We would like to stress how unacceptable and exceptional these events were. Datto’s colocation facility is designed to ensure that there is no downtime. The colocation facility receives its main power supply from the public electrical grid and backs up this main supply using three 2-MW generators and twelve separate uninterruptible power supplies, each with banks of batteries in parallel. Each row of cabinets, storing Datto Cloud servers, receives power from redundant power distribution units, associated with different uninterruptable power supplies connected to different segments of the facility’s electrical infrastructure.
The first event occurred on June 24 when a transformer operated by a public utility outside of Datto’s colocation center failed. This interrupted the main power supply and triggered the backup generators to come online. The generators responded promptly and were online within 10-15 seconds of the failure of the main power supply. After another 10-15 seconds power from the backup generators was available to the facility and, within a minute of the initial power failure, eleven of the facility’s twelve electrical sections were receiving electricity from the backup generators.
However, one of the facility’s twelve electrical segments failed to transition to generator power. This was a result of a failure of one of the main breakers. This prevented this section of the facility, which hosted a large section of the Datto Cloud, from transitioning to generator power and it instead continued to rely on the battery banks connected to the uninterruptible power supply serving this section of the facility.
The uninterruptible power supply, helping to support large sections of the Datto Cloud, was designed to be able to support infrastructure associated with the Datto Cloud for 40 minutes. However, despite being recently tested, the batteries supporting the uninterruptible power supply failed after 4 minutes. This caused the power distribution units, associated with the battery banks, to also lose power.
Each of our colocation facility cabinets are supported by redundant segments of facilities electrical infrastructure, each with separate power distribution units, such that the cabinets may be fully supplied from either segment. When the battery bank failed, the servers that used that section of the electrical infrastructure, shifted to rely on the other power distribution unit available to them. However, after 2 minutes, the increased load caused the breaker associated with the redundant power distribution unit to overload and also lose power.
This removed the redundant power supply available to a large section of the Datto Cloud, which at this point went down, having already lost the supply of electricity available through the other segment of the colocation facilities’ infrastructure.
Workers at the colocation center were able to manually switch the remaining section of the electrical system to generator power 15 minutes later, restoring power to the sections of the Datto Cloud without power, and within another 30 minutes, ordinary, off-the-grid power had been restored to the whole Datto Cloud infrastructure. Despite the best efforts of all involved to address the issues that caused the June 24 outage, these same events were repeated on July 4 although the recovery time was shortened.
Immediate Response
Datto employees follow an emergency procedure in the event of any serious, time-sensitive incident affecting the Datto Cloud. The emergency procedure incorporates both rotating on-call engineers and an additional company wide alert system, designed to ensure that all employees are aware of the issues affecting our infrastructure and are available to remediate such issues as quickly as possible.
On June 24, Datto employees became aware of the incident within 1 minute of the failure of the power distribution units that caused the Datto Cloud to lose power. The employees monitoring the issue followed Datto’s emergency procedures and issued a company wide alert 17 minutes after the initial outage. Within 2 hours of power becoming available, we had restored all customer web services and had began bringing the first storage nodes, associated with the Datto Cloud, online. We restored substantially all of the remaining storage nodes within 8 hours of power being restored to the Datto Cloud.
After the June 24 incident, Datto immediately implemented additional automated scripts to improve the recovery time of the Datto Cloud in the event of an outage or other serious problem and updated our response procedures to incorporate the lessons learned from the outage. In addition, workers at the datacenter expedited the purchase of new batteries to support the failed uninterruptible power supply and sought to reduce the burden on the power distribution unit that had failed during the outage.
On July 4, Datto employees became aware of the subsequent incident within 1 minute of the failure of the power distribution units serving the Datto Cloud. We implemented a company wide alert within 7 minutes of the initial outage. Within 45 minutes of power becoming available, Datto employes restored all web services to customers and had brought the first storage nodes online. We restored substantially all the storage nodes remaining within 4 hours of power becoming available to the Datto Cloud.
In response to the July 4 incident, Datto has worked with its colocation center to implement additional safeguards to protect the Datto Cloud. On July 10, the workers at the colocation center brought an additional redundant power distribution unit online to prevent a future overload in the event that Datto’s servers need to rely on their redundant power distribution units. In addition, workers at the colocation center are currently replacing the batteries associated with the failed uninterruptible power supply serving the Datto Cloud. This process should be complete by July 15. Until the new batteries are in place and have been tested, the Datto Cloud will remain on generator power to protect it in the event of a further failure of the public utility.
Datto Long Term Response
Datto is committed to ensuring that its cloud is as secure and redundant as possible and that, in the event of a serious incident, its recovery time is as fast as possible. We plan to implement significant changes to our infrastructure to prevent future outages and to improve cloud recovery times.
First, Datto will implement further geographic redundancy for its core services. This should ensure that in the event of a future datacenter loss, our services will continue to be available to our customers and, in the event that they become unavailable, will be able to be restored more quickly. We also plan to further automate our recovery processes to reduce the need for manual intervention and to further decrease recovery times.
Datto is also implementing additional controls over its primary datacenters to prevent future cloud services loss. This will include ensuring that the colocation center at which the power loss occurred completes a comprehensive independent third party audit designed to ensure that its systems are up to date and that it is following proper policies and procedures. Datto also plans to introduce additional monitoring to ensure that all systems are functioning properly and to provide additional warning of potential problems.
At Datto, we strive to ensure that our services are always available to our customers. We will continue to proactively improve the Datto Cloud and to introduce further redundancy and automation to ensure that our systems are never down.
Thank you for being a Datto Partner and for working together with us to build a world class data protection service.