November 29, 2021
A Peek Under the Covers of Datto SaaS Protection
We all know that the move to cloud infrastructure and cloud-hosted services has been increasing rapidly for a long time—and the pandemic has only accelerated that growth. Datto’s SaaS Protection service protects SaaS offerings such as Microsoft 365 and Google Workspace (formerly G Suite). Oftentimes with rapid growth comes significant change and occasionally instability as well. Even for companies like Microsoft and Google who operate at hyper-scale, this accelerated growth can create issues.
Datto specializes in data protection, which means quality, reliability, and resilience are at the forefront of our engineering priorities. The past year has tested all of us in these key areas. We’re an agile organization that is continually adapting and making incremental improvements. In this blog post, I’ll highlight some of the recent challenges we’ve encountered, the actions we’ve taken, and our plans for the future to provide best-in-class service to protect SaaS workloads.
While we support both Microsoft 365 and Google Workspace, I’m going to focus on M365 specifically because it represents an increasing majority of our customer base. Although, many of these same topics similarly apply to Google Workspace.
The Microsoft APIs are going through a transition from service-specific (e.g., Exchange Online, SharePoint, OneDrive) legacy APIs to the Graph API. We’ve been rapidly migrating to Graph and all of our new development is done in Graph. It’s been this way for a while now. We are aware of all service-specific API retirement dates and are staying well ahead of each and every one of those. We’re accelerating the transition on an as-needed basis since some service-specific legacy APIs have hard end of life (EOL) dates. There are two API situations that I’d like to highlight:
- APIs are changing what they return: We’re observing that Microsoft API behavior is changing on a fairly regular basis. The signature of the APIs doesn’t change, as that is a standard developer policy. However, the results returned from the API calls can (and do) change without notice. For example, we may regularly receive a specific string value that changes, or we receive different error codes from the same exact calls. I’ve seen our competitors publicly referencing these types of changes. To borrow a common phrase these days, we’re all in this together. The APIs we use are the same APIs our competitors are using (with one exception, outlined in the next bullet). This means that if we’re seeing errors or exceptions due to a Microsoft issue there’s a high likelihood that other solutions are as well.
- Using a beta API is risky: Microsoft makes certain functionality available via beta APIs, in order to get real-time feedback from its closest partners and customers. Datto, being a strategic partner to Microsoft, regularly leverages beta APIs to design and build new features for our partners and end customers as quickly as possible. As an example, we were proud to be the first to market with support for native Microsoft Teams backup last year. However, we will not sacrifice stability and reliability for speed. Microsoft’s beta APIs, as is the case with all pre-release code and functionality, come with inherent risk. Unlike what I said above, beta APIs can break standard policy (i.e., a signature change) or quite possibly go away completely with little notice or even no notice at all. We simply do not use beta APIs in production because of that risk. Beware of any vendor that may not have a similar practice, as it would be an inherent risk to the ability to protect and recover your data.
Like many cloud service providers, Microsoft implements throttling on API calls in order to ensure a high quality of service. During peak times, Microsoft will prioritize certain API calls over others. For example, a user request (e.g., fetching a message via an end-user client) will be prioritized over a third-party application request, like one from Datto SaaS Protection. When our API calls are throttled, we receive a specific error code, typically “429 Too Many Requests”. An advantage that we have as both the developer and the operator of the service, as opposed to an Independent Software Vendor (ISV) that licenses their software to a 3rd party provider, is that we have access to a mountain of telemetry data. We’ve made significant investments in analyzing this data with the express purpose of making our service more performant and reliable. A tangible example of this is a change that we made earlier this year to strategically schedule backups at periods of the day when we see the least amount of throttling errors. This has measurably increased our overall backup success rate.
Many cloud services that are going through hyper-growth similar to what M365 has been experiencing, especially over the past 18 months, must make changes to keep pace with demand. That involves not only adding new features, but also adding infrastructure to support the demand. There have been two events in the past year that have had a significant impact on our service.
The first such event happened on March 15th when there was a global authentication outage. This was a very public event and impacted end users and service providers alike. For us, it caused a major influx of Datto SaaS Protection support requests which more than doubled our expected support volume for the month of March. Attending to all of the support requests took time and created a significant backlog. We’ve since made process and staffing changes to be able to handle such an event, should it happen again in the future.
At virtually the same time the authentication issue occurred, we also saw errors on one of our peering links in the US region. Microsoft offers a peering service that provides for lower latency and higher reliability for traffic to/from Microsoft services such as M365 and Azure. We invest in these links when we reach a certain level of scale in a given region. The use of peering links is mutually beneficial to Datto and our partners. In the US region, we have multiple peering links to our data centers, only one of which was exhibiting errors which made it harder to diagnose the exact problem. The biggest challenge identifying the root cause of the peering link errors was that we were simply getting Microsoft API errors returned to us which looked just like the errors we received during the authentication outage. Due to the timing, these problems blended together to create a perfect storm and prompted a surge in support tickets.
The last Microsoft event of note happened on successive days in May. Our monitoring and alerting systems quickly identified a problem because our KPIs began to rapidly drop. After investigation, we identified the root cause as a TLS negotiation failure. Microsoft was applying rolling updates to their accepted TLS versions and ciphers. We were already using TLS v1.2. However, Microsoft hand-selected a specific set of ciphers within TLS v1.2 that they would only accept. Because this change was not well communicated, we opened a production down ticket with Microsoft. I can only assume other vendors did the same, because very soon thereafter Microsoft halted the rolling updates and reverted the changes (a very rare thing indeed). Unfortunately, Microsoft began rolling these updates again just 24 hours later. We were in the process of testing the cipher changes, but we had to quickly pivot to deploying the TLS changes to our fleet. This provided an opportunity to partner with Microsoft on an appropriate communication scheme to properly warn us of these changes in the future. Believe it or not, even our premium support contacts were unaware of this maintenance.
Our Microsoft partnership
Given how important Microsoft is to us and our customers, we are making significant investments in that relationship. There are a few things to highlight in particular:
- We have had a Microsoft premium developer support contract for the last two years. This allows for a dedicated support manager to help escalate our support tickets and also provides access to additional technical resources.
- In addition to the premium developer support contract, we purchase blocks of developer consulting hours on an as needed basis. We’ve used these in a few different ways:
- When we’re building new functionality, we can call on an expert in a given area who can review our design and logic to be sure that we’re using the right APIs and interpreting the data in the appropriate way. It’s a convenient way to get an expert from Microsoft to aid in the design and review of our solution.
- For specific support cases which we feel are not getting the right attention, we can use these consulting hours to get a dedicated resource assigned. In a way, we’re paying to get more timely attention to some of our most important Microsoft issues. Sometimes, this results in Microsoft having to make a code change which could take an extended period of time. Other times they can suggest a workaround or an alternative solution to avoid the problem completely.
- Lastly, from the Microsoft support perspective, we are adding an option to our premium support that will allow our partners and customers to jointly work a Microsoft ticket together with us. This will not only provide transparency on status, but also shorten the cycle time on certain tickets. We often become the person in the middle (i.e., Microsoft asks us to ask the partner or end customer something and vice versa). We think this will make a big difference in specific situations and is just another value add that we can bring.
- With Datto Continuity for Microsoft Azure now in Early Access, we are not only a partner with Microsoft, we are a significant emerging customer as well. We are collectively incentivized to make our engagement a big success. It’s truly a win-win-win (Datto, Microsoft, and our partners).
We have just scratched the surface of some of the stories and technical challenges that we have encountered. Check back for future blogs focusing on different aspects of Datto SaaS Protection.