February 01, 2017
GitLab Accidentally Deletes 300 GB Of Data
These days, ransomware often makes headlines when a company loses their data, but a different threat has struck GitLab—human error.
According to The Register, a sysadmin accidentally deleted a directory on the wrong server while replicating a database, inadvertently erasing 300GB of production data. To make matters worse, their most recent backup was six hours earlier. Currently, GitLab estimates it will take around 20 hours for the initial pg_basebackup sync.
GitLab staff said they have five backup techniques deployed, but they are still unable to recover the data. They further detailed the “problems encountered” in a Google Doc. From the document:
Logical volume manager (LVM) snapshots are by default only taken once every 24 hours. [Employee 1] happened to run one manually about 6 hours prior to the outage
Regular backups seem to also only be taken once per 24 hours, though [Employee 1] has not yet been able to figure out where they are stored. According to [Employee 2] these don’t appear to be working, producing files only a few bytes in size.
Disk snapshots in Azure are enabled for the NFS server, but not for the DB servers.
The synchronisation process removes webhooks once it has synchronised data to staging. Unless we can pull these from a regular backup from the past 24 hours they will be lost
The replication procedure is super fragile, prone to error, relies on a handful of random shell scripts, and is badly documented
Our backups to S3 apparently don’t work either: the bucket is empty
We don’t have solid alerting/paging for when backups fails, we are seeing this in the dev host too now.
This is a wakeup call not only to protect your data from accidental deletion, but that it is important to undergo regular testing to protect your network and critical data. If you’re looking to learn more about your network including a risk assessment score, check out Datto’s BDR Assessment Tool!