December 30, 2015
Hard drive reliability at cloud scale
As a backup and disaster recovery provider maintaining a 180+ petabyte (PB) cloud as well as a large number of on-premises backup devices, Datto obviously relies heavily on hard drives. And while a variety of data protection methods are used to protect against data loss in the event of a drive failure, choosing reliable hard drives is also a top priority.
In November, an in-house study showed that hard drive reliability is currently satisfactory in Datto data centers and backup devices. It revealed some interesting data about drive reliability in general, as well.
The study was designed to analyze how external conditions impact survival rates of a variety of hard drives in use in Datto data centers and devices. Not surprisingly, the results showed that a combination of temperature and load is the largest killer of hard drives. However, the study also showed that impact varied between similar drives from the same manufacturer.
For example, tests of Datto’s most commonly used hard drive, a 3 TB Western Digital model, showed that conditions for drive survival differ slightly between Datto’s two largest data centers. However, in backup devices, the same model’s reliability results fell between the results of those of the two large data centers. This suggests that the primary cause for drive failure in these systems is not something that differs in a large way between controlled (data center) and uncontrolled (on-premises) environments.
On the other hand, testing of a 2 TB drive from the same manufacturer which is also used in Datto data centers as well as backup devices, resulted in significantly different outcomes for each. This drive proved reliable in Datto cloud data center servers, but failed at a much faster rate in Datto on-site devices. This implies that this particular model is sensitive to conditions in less than ideal environments, such as high temperature, humidity, dust and vibration. Conversely, testing of another 2 TB drive, showed that it was less reliable in data centers than in backup appliances. So, while that drive is resistant to environmental factors, it is less suitable for handling constant high load.
Cool it, buddy
According to the study, the highest rates of drive failure occurred in Datto data center servers. That isn’t particularly surprising considering data center server drives are used much more heavily than backup appliance drives. As noted above, consistent high loads kill hard drives quickly. And, active servers generate a lot of heat. So, heavily loaded servers are subject to high heat by default. That’s why cooling is a key attribute in good data center design. On the flip side, on-site backup devices may be exposed to high heat (and other environmental factors) externally, but don’t generate as much heat on their own.
Regardless, temperature is a major factor in drive reliability and should be considered when deploying any technology. While cooling is not as much of a factor for drives in on-site backup devices, it still should be considered. Care should be taken to mitigate environmental conditions that will contribute to drive failure. In other words, that dusty spot under a desk by the heating duct—not a good look for increasing the lifespan of drives.