This guide aims to outline Datameer backup and HDFS disaster recovery best practices for typical Datameer implementations. Topics covered include the backup of Datameer's database and local artifacts, as well as supplying Hadoop distribution specific documentation for best practices.
What is Disaster Recovery?
Disaster recovery (DR) involves a set of policies and procedures to enable the recovery or continuation of vital technology infrastructure and systems following a natural or human-induced disaster. Disaster recovery focuses on the IT or technology systems supporting critical business functions.
A full description of Disaster Recovery is available on Wikipedia.
Protecting Datameer from a Disaster
What to protect
Configuration files, JARs and custom plug-ins
Datameer configuration files contain environment specific properties that significantly influence the application depending on the implementation. To ensure a quick return to service, they should be retained in a backup.
Configuration files can be found at <Datameer Home>/conf. Backup all files and subfolders.
JARs and custom plug-ins can be found in <Datameer Home>/etc. Backup all files and subfolders.
Datameer DAP database
A full backup of the application database allows you to recover from unexpected software or hardware failures when there is high possibility to lose large amounts of Datameer metadata. This is also a prerequisite for upgrades or moving the Datameer installation.
Default installations of Datameer include a MySQL DAP database. The database contains all metadata information relating to Datameer artifacts. Without this database, the product can't function.
Datameer artifacts are stored in the local filesystem when a job is run with the custom property:
das.execution-framework=Local. This is appropriate for certain usage scenarios, but it becomes necessary to back these files up.
- Backup <Datameer Home>/das-data
- Backup <Datameer Home>/data
- Backup <Datameer Home>/data
- Backup <Datameer Home>/data
Datameer application backup script
Datameer services prepared a backup script to assist backing up the Datameer application files. It backs up the configuration files, local execution files, and the DAP database. Configuration is necessary within the script to specify the Datameer home folder location, MySQL location and credentials, backup folder location, HDFS backup, SCP backup, and retention policy.
Disaster Recovery for Each Hadoop Distribution (Protect the Datameer HDFS Private Folder)
Each Hadoop distribution has different tools available for protecting your data and recovering from a disaster in addition to the common DistCp and HDFS Snapshot tools built by Apache. Below is a summary for each distribution.
Available for all distributions
DistCp Version 2
DistCp Version 2 (distributed copy) is a tool used for large inter/intra-cluster copying. It uses MapReduce to effect its distribution, error handling and recovery, and reporting. It expands a list of files and directories into input to map tasks, each of which copies a partition of the files specified in the source list.
Apache documentation: DistCp Version 2
HDFS Snapshots are read-only point-in-time copies of the file system. Snapshots can be taken on a subtree of the file system or the entire file system. Some common use cases of snapshots are data backup, protection against user errors and disaster recovery.
Apache documentation: HDFS Snapshots
Cloudera Manager provides an integrated, easy-to-use management solution for enabling data protection on the Hadoop platform. Cloudera Manager enables you to replicate data across datacenters for disaster recovery scenarios. Replications can include data stored in HDFS, data stored in Hive tables, Hive metastore data, and Impala metadata (catalog server metadata) associated with Impala tables registered in the Hive metastore.
Cloudera documentation: Backup and Disaster Recovery
Cloudera Manager enables the creation of snapshot policies that define the directories or tables to be snapshotted, the intervals at which snapshots should be taken, and the number of snapshots that should be kept for each snapshot interval. You can also create HBase and HDFS snapshots using Cloudera Manager or by using the command line.
Cloudera documentation: Cloudera Manager Snapshot Policies
Cloudera Enterprise Backup and Disaster Recovery (BDR) uses replication schedules to copy data from one cluster to another, enabling the second cluster to provide a backup for the first. In case of any data loss, the second cluster—the backup—can be used to restore data to production.
Cloudera tutorials: BDR Tutorials
Mirroring Data with Falcon
You can mirror data between on-premise clusters or between an on-premises HDFS cluster and a cluster in the cloud using Microsoft Azure or Amazon S3. Mirroring data produces an exact copy of the data and keeps both copies synchronized. You can use Falcon to mirror HDFS directories, Hive tables, and snapshots.
HortonWorks documentation: Mirroring Data with Falcon
Falcon can replicate data across multiple clusters using DistCp A replication feed allows you to set a retention policy and do it according to the frequency you specify in the feed entity. Falcon uses a pull-based replication mechanism, meaning in every target cluster, for a given source cluster, a coordinator is scheduled that pulls the data using DistCp from the source cluster.
HortonWorks documentation: Replicating Data with Falcon
Apache Falcon simplifies the configuration of data motion with: replication; lifecycle management; lineage and traceability. This provides data governance consistency across Hadoop components. This tutorial walks through a scenario where email data gets processed on multiple HDP clusters around the country then gets backed up hourly on a cloud hosted cluster.
Traditional 'distcp' from one directory to another or from cluster to cluster has limitations when it comes to doing updates. These limitations can lead to incorrect updates or incomplete updates. This document explores leveraging HDFS snapshots with distcp to eliminate this problem.
HortonWorks article: Managing Hadoop DR with 'distcp' and 'snapshots'
Disaster recovery plan or a business process contingency plan is a set of well-defined process or procedures that needs to be executed so that the effects of a disaster is minimized and the organization is able to either maintain or quickly resume mission-critical operations.
The MapR Converged Data Platform includes backup and mirroring capabilities to protect against data loss after a site-wide disaster. MapR is the only big data platform that provides built-in, enterprise-grade DR for files, databases, and events. MapR was built to address real-world DR scenarios where lost data and downtime result in lost revenue, lost productivity, and/or failed opportunities.
MapR documentation: Disaster Recovery
The ability to create and manage snapshots is an essential feature expected from enterprise-grade storage systems. This capability is increasingly seen as critical with big data systems as well. Snapshot means capturing the state of the storage system at an exact point in time and is used to provide full recovery of data in the event of data loss.
MapR documentation: MapR Snapshots
A severe natural disaster can cripple an entire datacenter, leading to permanent data loss unless a disaster plan is in place.
Solution: Mirroring to another cluster
MapR tutorial: Mirroring to another cluster
The concept of promoting a mirror refers to the ability to make a read-only mirror volume into a read-write volume. The main use case for this feature is to support disaster-recovery scenarios in which a read-only mirror needs to be promoted to a read-write volume so that it can become the primary volume for data storage.
MapR tutorial: Using Promotable Mirrors for Disaster Recovery