Profile Applicability:
- Level 2
Description:
Continuous Disaster Recovery (CDR) ensures that your organization is prepared to quickly respond to disaster scenarios by maintaining up-to-date replicas of critical data and systems. This practice ensures that recovery sites are continuously synchronized and that failover capabilities are always ready, minimizing downtime in the event of an actual disaster. Ensuring continuous disaster recovery operations involves configuring replication, monitoring, and testing recovery processes to guarantee the availability of resources and data in the event of a failure.
Rationale:
Continuous Disaster Recovery ensures:
High availability and resilience by maintaining real-time or near-real-time copies of critical workloads
Faster recovery times and minimal downtime during disaster events
Seamless transition between primary and recovery environments with minimal disruption to services
Improved compliance with organizational and regulatory standards for disaster recovery and data protection
Default Value:
By default, continuous disaster recovery is not configured. Manual setup and ongoing monitoring are required to maintain real-time replication and failover capabilities.
Impact:
Pros:
• Ensures rapid recovery in case of a disaster with minimal data loss
• Reduces downtime through proactive replication and monitoring
• Strengthens compliance with disaster recovery and data protection regulations
• Provides greater resilience by ensuring systems are continuously available across regions
Cons:
• Requires continuous monitoring and maintenance to ensure replication and failover processes work correctly
• Involves costs for maintaining replication infrastructure, bandwidth, and monitoring tools
• Misconfigurations or outdated backup strategies may lead to delayed recovery or data inconsistency
Pre-requisites:
IAM Permissions Required:
drs:DescribeReplicationJobs, drs:StartFailover, drs:UpdateReplicationJob, ec2:DescribeInstances, ec2:StartInstances
Permissions to manage disaster recovery replication, monitor processes, and initiate failover when necessary
Remediation:
Test Plan:
Using AWS Console:
- Log in to the AWS Management Console
- Navigate to Elastic Disaster Recovery (EDR)
- Verify that replication jobs are active and running for critical workloads
- Check that failover settings are configured, allowing workloads to be quickly transferred to a secondary environment if needed
- Review the Recovery Point Objective (RPO) and Recovery Time Objective (RTO) settings to ensure they meet business requirements
- Test the disaster recovery process by running a test failover to ensure that recovery sites are correctly synchronized
Ensure CloudWatch metrics and alarms are set up to monitor replication and failover statuses
Using AWS CLI:
aws drs describe-replication-jobs aws drs start-replication-job \ --job-id <replication-job-id> aws drs start-failover \ --job-id <failover-job-id>
Implementation Plan:
Using AWS Console:
- Navigate to Elastic Disaster Recovery and select Replication Jobs
- Create or verify the replication jobs for critical workloads, ensuring that the replication intervals and data consistency settings meet the recovery goals
- Ensure that disaster recovery plans are in place for failover to secondary sites, with automated triggers set up for failover in the event of a failure
- Confirm that CloudWatch monitoring is in place to continuously check the health of replication tasks and recovery environments
- Run a test failover to confirm that the disaster recovery plan functions as expected
- Ensure that alerts are configured for any replication failures or recovery issues
Using AWS CLI:
Step 1: List all active replication jobs
aws drs describe-replication-jobs
Step 2: Start replication job if not already running
aws drs start-replication-job \ --job-id <replication-job-id>
Step 3: Start a failover for testing the process
aws drs start-failover \ --job-id <failover-job-id>
Step 4: Monitor replication job status
aws drs describe-replication-jobs \ --job-id <replication-job-id>
Backout Plan
Using AWS Console:
- If continuous disaster recovery operations are misconfigured or not performing as expected, verify that the replication jobs are correctly configured
- Adjust the RPO and RTO to meet requirements, and ensure that the replication interval is set for appropriate frequency
- Reconfigure failover settings and perform another test failover
- If a test failover is unsuccessful, troubleshoot the recovery site to ensure it is correctly synchronized and operational
Using AWS CLI:
To stop a replication job or failover:
aws drs stop-replication-job \ --job-id <replication-job-id>
To revert a failover:
aws drs start-failback \ --job-id <failback-job-id>
References:
- https://docs.aws.amazon.com/drs/latest/userguide/working-with-replication.html
- https://docs.aws.amazon.com/cli/latest/reference/drs/start-failover.html
- https://aws.amazon.com/disaster-recovery/