Part 4: Data Archiving
Part 4: Data Archiving
Most organizations need to archive data for either compliance or other legal reasons which, in turn, drive the archiving policies. The archiving policies will determine what data needs to be archived, how frequently it needs to be archived, and how long it needs to be kept. For most organizations, not all data needs to be archived and different types of data may need to follow different archival policies. The compliance and/or legal reasons for data archiving will also determine the financial and legal risks of not being able to produce the archived data when required.
Traditionally, data archiving has been performed by an organization’s backup systems. This has led to confusion between data backup and data archives. The functional difference is that backups are temporary copies of data used to quickly recover data due to equipment failures, user error, or other data corrupting conditions such as viruses and ransomware. Archives are long term copies of data which are rarely, if ever, accessed and are created for the purposes of regulatory compliance or other legal necessity such as patent/copyright disputes. Another use for archival techniques is asset management in organizations where data such as CAD drawings or advertising materials need to be maintained.
One of the main reasons that archival data is accessed is for legal discovery. In legal discovery, the inability to produce the required data can lead to the loss of a legal case which can have financial and civil ramifications. The main difference between how archives and standard backups are created is that archives are typically written to a separate set of media which is either offsite, or taken offsite. This separate set of media is typically a write once read many (WORM) form of storage so the data on the media can’t be altered in any way after it has been written. Traditionally, archival media has been special WORM tapes, or WORM optical storage. Archival data is also typically encrypted to protect corporate data.
Knowing what data needs to be archived, how often it needs to be archived, and how long it needs to be kept determines the total amount of storage and/or media required to comply with the data archival policies. As a simple example, if 1TB of data needs to be archived monthly and kept for 7 years, it will require 1 x 12 x 7 = 84TB of total storage, assuming WORM media. This would require 9 LTO 8 WORM tapes (assuming each tape was used to hold 9 months’ worth of archives), or 84 LTO 8 WORM tapes if each tape only held one month worth of archives. Legal reasons may make it desirable to use a unique tape for each archival period (84 tapes in this example). If this is the case, 89% of the capacity of each tape is wasted.
If the data needs to be kept for extended periods of time, the life span of the media needs to be taken into consideration since media degrades over time. This may require that the data is periodically moved to new media to ensure that it is still reliably readable. This may also require the acquisition of new tape or optical drives as technologies change over time. Maintenance contracts need to be kept for these drives to ensure that they operational over their lifespans. There are additional costs that need to be considered if removable media such as tape or optical disk are used for archival. These include the time required to manage the media, the cost to transport the media to and from the archival location, and the cost to store the media. For legal reasons, it is often desirable to destroy the data and, in the case of WORM media, physically destroying the media once the data is no longer required to be archived. This adds additional cost and risk. If the archived data is encrypted, there are even more costs to implement and maintain a key management system for the encryption keys. There is also risk associated with damage that can occur to the media, as well as the possibility of media being misplaced.
An alternative to using WORM media for data archiving is to use cloud based storage. Cloud storage has several advantages over using removable media. These benefits allow you to focus on your core business and eliminate the concerns associated with data archival requirements.
- Extreme data durability, typically 99.999999999% (11 nines) durability due to multiple copies being stored on multiple independent disk systems. Typically, 3 copies in at least 2 geographic locations. 99.99999999999% durability translates to losing 1 file every 8 years per PetaByte (PB) of data.
- No costs associated with transporting the data/media offsite, excluding internet bandwidth
- No possibility of misplacing media
- No need to migrate data to new media due to media aging.
- End to end encryption and all associated costs included in the price of the cloud based storage
- Reduced management costs
- No fees associated with the destruction of data once it is no longer needed
- No wasted space since billing is based only on the actual amount of data being stored
- Reduction in the total amount of data stored to more efficient data de-duplication and compression methods.
Cloud storage services are a purely operational expense and are presented as a variable monthly recurring fee which increases as more data is stored and decreases as data is deleted. Several examples of how this billing works with Amazon Glacier Deep Archive storage, which is bill at $0.00099/GB/month, or roughly $1.00/TB/month are given at the end of this article.
In addition to standard periodic data archiving, some archival policies call for archives of every version of the files being archived. This poses a challenge for traditional backup-based archiving systems, since they only create archives at a fixed point in time. In order to keep archives of every version of a file, some knowledge of when the files are modified is required. To address this type of archiving a common solution is to use a cloud backed network attached storage (NAS) device, also known as a cloud storage gateway. These devices look like a standard NAS solution to the end user, that is, they present a file share to store data on. The difference is that not only do these devices keep the data on local disk, they also replicate it to cloud based storage, and can be configured to keep all versions of the files stored on them on the cloud storage, which can be configured as WORM storage for compliance purposes. Additionally, many of these cloud storage gateways can be configured to automate your data retention policies. This means that they know how long data needs to be kept as an archive, and can take appropriate action, such as deleting the information once it is no longer needed. An additional benefit of these devices is that they have sophisticated de-duplication and compression capabilities which help reduce the actual amount of cloud based storage required.
Next up, Part 5: High Availability (HA)
Get the FREE eBook
This is part 4 of 10 in the From High Availability to Archive: Enhancing Disaster Recovery, Backup and Archive with the Cloud series. To read them all right now download our free eBook.