Disaster Recovery

Preface

This guide aims to outline Datameer backup and HDFS disaster recovery best practices for typical Datameer implementations. Topics covered include the backup of Datameer's database and local artifacts, as well as supplying Hadoop distribution specific documentation for best practices.

What is Disaster Recovery?

Disaster recovery (DR) involves a set of policies and procedures to enable the recovery or continuation of vital technology infrastructure and systems following a natural or human-induced disaster. Disaster recovery focuses on the IT or technology systems supporting critical business functions.

A full description of Disaster Recovery is available on Wikipedia.

Protecting Datameer from a Disaster

What to protect

Configuration files, JARs and custom plug-ins

Datameer configuration files contain environment specific properties that significantly influence the application depending on the implementation. To ensure a quick return to service, they should be retained in a backup.

Configuration files can be found at <Datameer Home>/conf. Backup all files and subfolders.

JARs and custom plug-ins can be found in <Datameer Home>/etc. Backup all files and subfolders.

Datameer DAP database

A full backup of the application database allows you to recover from unexpected software or hardware failures when there is high possibility to lose large amounts of Datameer metadata. This is also a prerequisite for upgrades or moving the Datameer installation.

Default installations of Datameer include a MySQL DAP database. The database contains all metadata information relating to Datameer artifacts. Without this database, the product can't function.

Local execution

Datameer artifacts are stored in the local filesystem when a job is run with the custom property: das.execution-framework=Local. This is appropriate for certain usage scenarios, but it becomes necessary to back these files up.

Datameer 5.x:

Backup <Datameer Home>/das-data
Backup <Datameer Home>/data

Datameer 6.x:

Backup <Datameer Home>/data

Datameer 7.x:

Backup <Datameer Home>/data

Datameer application backup script

Datameer services prepared a backup script to assist backing up the Datameer application files. It backs up the configuration files, local execution files, and the DAP database. Configuration is necessary within the script to specify the Datameer home folder location, MySQL location and credentials, backup folder location, HDFS backup, SCP backup, and retention policy.

Download: das_application_backup_sh.sh

Disaster Recovery for Each Hadoop Distribution (Protect the Datameer HDFS Private Folder)

Each Hadoop distribution has different tools available for protecting your data and recovering from a disaster in addition to the common DistCp and HDFS Snapshot tools built by Apache. Below is a summary for each distribution.

Available for all distributions

DistCp Version 2

DistCp Version 2 (distributed copy) is a tool used for large inter/intra-cluster copying. It uses MapReduce to effect its distribution, error handling and recovery, and reporting. It expands a list of files and directories into input to map tasks, each of which copies a partition of the files specified in the source list.

Apache documentation: DistCp Version 2

HDFS Snapshots

HDFS Snapshots are read-only point-in-time copies of the file system. Snapshots can be taken on a subtree of the file system or the entire file system. Some common use cases of snapshots are data backup, protection against user errors and disaster recovery.

Apache documentation: HDFS Snapshots

Cloudera

Cloudera Manager

Cloudera Manager provides an integrated, easy-to-use management solution for enabling data protection on the Hadoop platform. Cloudera Manager enables you to replicate data across datacenters for disaster recovery scenarios. Replications can include data stored in HDFS, data stored in Hive tables, Hive metastore data, and Impala metadata (catalog server metadata) associated with Impala tables registered in the Hive metastore.

Cloudera documentation: Backup and Disaster Recovery

Cloudera Manager Snapshots

Cloudera Manager enables the creation of snapshot policies that define the directories or tables to be snapshotted, the intervals at which snapshots should be taken, and the number of snapshots that should be kept for each snapshot interval. You can also create HBase and HDFS snapshots using Cloudera Manager or by using the command line.

Cloudera documentation: Cloudera Manager Snapshot Policies

Cloudera Enterprise BDR

Cloudera Enterprise Backup and Disaster Recovery (BDR) uses replication schedules to copy data from one cluster to another, enabling the second cluster to provide a backup for the first. In case of any data loss, the second cluster—the backup—can be used to restore data to production.

Cloudera tutorials: BDR Tutorials

HortonWorks

Mirroring Data with Falcon

You can mirror data between on-premise clusters or between an on-premises HDFS cluster and a cluster in the cloud using Microsoft Azure or Amazon S3. Mirroring data produces an exact copy of the data and keeps both copies synchronized. You can use Falcon to mirror HDFS directories, Hive tables, and snapshots.

HortonWorks documentation: Mirroring Data with Falcon

Replicating Data with Falcon

Falcon can replicate data across multiple clusters using DistCp A replication feed allows you to set a retention policy and do it according to the frequency you specify in the feed entity. Falcon uses a pull-based replication mechanism, meaning in every target cluster, for a given source cluster, a coordinator is scheduled that pulls the data using DistCp from the source cluster.

HortonWorks documentation: Replicating Data with Falcon

Incremental backup of data using Falcon for Disaster Recovery and Burst Capacity

Apache Falcon simplifies the configuration of data motion with: replication; lifecycle management; lineage and traceability. This provides data governance consistency across Hadoop components. This tutorial walks through a scenario where email data gets processed on multiple HDP clusters around the country then gets backed up hourly on a cloud hosted cluster.

HortonWorks tutorial: Incremental backup of data from HDP to Azure using Falcon for Disaster Recovery and Burst Capacity

Managing Hadoop DR with 'distcp' and 'snapshots'

Traditional 'distcp' from one directory to another or from cluster to cluster has limitations when it comes to doing updates. These limitations can lead to incorrect updates or incomplete updates. This document explores leveraging HDFS snapshots with distcp to eliminate this problem.

HortonWorks article: Managing Hadoop DR with 'distcp' and 'snapshots'

Disaster recovery and Backup best practices in a typical Hadoop Cluster

Disaster recovery plan or a business process contingency plan is a set of well-defined process or procedures that needs to be executed so that the effects of a disaster is minimized and the organization is able to either maintain or quickly resume mission-critical operations.

HortonWorks articles: Series 1, Series 2

MapR

Disaster Recovery

The MapR Converged Data Platform includes backup and mirroring capabilities to protect against data loss after a site-wide disaster. MapR is the only big data platform that provides built-in, enterprise-grade DR for files, databases, and events. MapR was built to address real-world DR scenarios where lost data and downtime result in lost revenue, lost productivity, and/or failed opportunities.

MapR documentation: Disaster Recovery

MapR Snapshots

The ability to create and manage snapshots is an essential feature expected from enterprise-grade storage systems. This capability is increasingly seen as critical with big data systems as well. Snapshot means capturing the state of the storage system at an exact point in time and is used to provide full recovery of data in the event of data loss.

MapR documentation: MapR Snapshots

Scenario: Disaster Recovery

A severe natural disaster can cripple an entire datacenter, leading to permanent data loss unless a disaster plan is in place.
Solution: Mirroring to another cluster

MapR tutorial: Mirroring to another cluster

Using Promotable Mirrors for Disaster Recovery

The concept of promoting a mirror refers to the ability to make a read-only mirror volume into a read-write volume. The main use case for this feature is to support disaster-recovery scenarios in which a read-only mirror needs to be promoted to a read-write volume so that it can become the primary volume for data storage.

MapR tutorial: Using Promotable Mirrors for Disaster Recovery