Profile Applicability:

  • Level 2

Description:

A disaster recovery failover refers to the process of switching to a secondary environment when the primary environment is compromised or unavailable. Ensuring the proper execution of a failover is essential for maintaining business continuity. This includes verifying that the failover process is correctly triggered, workloads are successfully moved to the recovery site, and the secondary environment is fully operational.

Rationale:

Ensuring the proper execution of a disaster recovery failover provides:

  • Continuity of business operations during a disaster or failure of the primary site

  • Minimized downtime and impact on users and services

  • Verified recovery in a safe environment, ensuring business-critical workloads continue running

  • Proactive and reliable response to failures with minimal manual intervention

Default Value:

Failover is not performed automatically. It must be manually initiated after confirming that the primary environment is down or otherwise unavailable for service.

Impact:

Pros:
 • Ensures business continuity and minimal downtime during a failure
• Provides a tested and reliable failover process for workloads and services
• Enables proactive recovery without relying on the primary site during a disaster
 • Supports compliance and resilience requirements for disaster recovery

Cons:
 • Requires careful configuration to ensure failover is correctly triggered and executed
• May require downtime for workloads while transitioning to the recovery environment
 • Misconfigurations or errors could result in incomplete failover or data inconsistencies

Pre-requisites:

IAM Permissions Required:
 
drs:StartFailoverdrs:DescribeFailoverJobsdrs:UpdateFailoverStatusec2:DescribeInstancesec2:StopInstancesec2:StartInstances
 Permissions to initiate failover, monitor the process, and access recovery resources

Remediation:

Test Plan:

Using AWS Console:

  1. Log in to the AWS Management Console
  2. Navigate to Elastic Disaster Recovery (EDR)
  3. Confirm that the primary environment is unavailable or experiencing issues
  4. Go to the Failover Jobs section and click Start Failover
  5. Monitor the failover process to ensure that workloads are transitioned to the recovery site
  6. After successful failover, verify that applications and data are running as expected in the recovery environment
  7. Ensure that network, storage, and security settings are properly configured in the failover environment

Using AWS CLI:

aws drs describe-failover-jobs \
  --job-id <failover-job-id>
aws drs start-failover \
  --job-id <failover-job-id>
aws drs update-failover-status \
  --job-id <failover-job-id> \
  --status COMPLETE

Implementation Plan:

Using AWS Console:

  1. Navigate to Elastic Disaster Recovery (EDR) and confirm the primary site is inoperable or unavailable
  2. In the Failover Jobs section, select the Start Failover option
  3. Review the configuration and confirm the workloads or resources to be failed over
  4. Begin the failover process and monitor its progress in the Failover Jobs dashboard
  5. After the failover is complete, confirm that the workloads are operational in the recovery environment and accessible by end-users
  6. Test by performing a few key operations to ensure the system functions correctly in the new environment

Using AWS CLI:
 Step 1: List failover jobs

aws drs describe-failover-jobs \
  --job-id <failover-job-id>

Step 2: Start the failover job

aws drs start-failover \
  --job-id <failover-job-id>

Step 3: Monitor the failover job

aws drs describe-failover-jobs \
  --job-id <failover-job-id>

Step 4: Update the failover status after successful completion

aws drs update-failover-status \
  --job-id <failover-job-id> \
  --status COMPLETE

Backout Plan:

Using AWS Console:

  1. If the failover was unsuccessful or caused issues, initiate a failback to the primary site once it is restored
  2. If the failover is incomplete or incorrect, investigate the root cause and adjust the configuration before retrying
  3. Revert network, security, and storage settings in the recovery site if required

Using AWS CLI:
 To cancel the failover:

aws drs cancel-failover \
  --job-id <failover-job-id>

To re-trigger failback once the primary site is available:

aws drs start-failback \
  --job-id <failback-job-id>

References: