Building in Reliability and Durability at Datto

By Bob Petrocelli

When I sold my second startup (a ZFS centric storage software company) to Oracle in 2014, I sat down for coffee with the Oracle EVP who ran the systems division (the former Sun Microsystems). He shared the following opinion: “ There are two technologies that are impossible to get perfectly correct: CPUs and Storage”. I believe that from a certain perspective, he was right. Why? Well simply put, as you scale down the CPU process size or scale up the storage devices the laws of quantum physics—that govern how the hardware works—begin to introduce error.

In CPUs, you can get quantum tunneling effects that cause the logic gates to fail and in storage, you get UBER (uncorrectable bit error rates). In the case of storage devices themselves, the UBER (about 1x10-15) has remained constant even as capacities have increased by orders of magnitude. In short, data loss is a certainty given enough time - the key of course is to design systems that actively manage the storage devices so that the MTTDL (mean time to data loss) is greater than the useful life of the data.

At Datto we actively manage storage so that the durability of the data (think of this as the inverse of MTTDL) or probability of data loss annually, is measured in 9’s (more 9’s are better). At Datto, we seek to achieve (depending on the application) between 6 and 11 9’s of data durability in the cloud and about 5 9’s on the edge (the Siris device fleet). Without going too deeply into the math we expect data durability to exceed most rational retention policies.

We can back test this - and have had no known incidents of unrecoverable data loss from system failures in the cloud (about 1.6EB) in the trailing 24 months.

But this is only half of the story. Data is useless if you can’t access it. The second consideration in our cloud architecture is availability. We sometimes refer to this internally as blast radius. We design our critical infrastructure to be free of single points of failure - but multiple failures can, and do happen. When they do, a well-designed system will have a failure mode that limits the service interruption to a predefined level. In the case of BCDR, total failure of a cloud node would interrupt cloud access to about 0.026% of BCDR agents. Of course, this does not include the edge fleet which provides the primary replica for typical continuity operations.

At Datto, we have a saying: “Backup is easy, restore is hard”. Cloud operations and cloud storage are in our DNA and we are, as always, committed to continuous improvement for our partners and their clients.