Profile Applicability:
Level 1
Description:
Amazon ElastiCache for Redis is a managed in-memory data store service. Automatic failover for Redis clusters ensures that in the event of a failure in a primary Redis node, a replica node automatically takes over the primary role to maintain the availability and continuity of the service. This SOP ensures that automatic failover is enabled for Redis clusters to improve availability and reduce downtime during failures, ensuring high availability and resilience of Redis services in production environments.
When enabled, automatic failover allows ElastiCache to automatically promote a replica node to the primary role when the primary node becomes unavailable, thus preventing downtime and service interruption.
Rationale:
Enabling automatic failover ensures that Redis clusters can withstand certain failures, such as node crashes or network issues, without requiring manual intervention. Key benefits include:
High Availability: Automatic failover ensures the Redis cluster remains available even in the event of a primary node failure.
Minimized Downtime: Automatic failover reduces the downtime required to recover from failures, as the replica is automatically promoted.
Operational Efficiency: Reduces the need for manual intervention, helping maintain operational continuity and increasing system reliability.
Compliance: Helps meet the availability and fault tolerance requirements of regulatory standards like SOC 2, PCI-DSS, or HIPAA.
Impact:
Pros:
Improved Reliability: Ensures the Redis cluster remains operational, even if the primary node fails.
Reduced Operational Overhead: Eliminates the need for manual intervention during failover events, improving operational efficiency.
Compliance: Meets high availability requirements of various compliance frameworks, such as PCI-DSS and SOC 2.
Cons:
Resource Usage: Enabling automatic failover requires additional resources, such as replica nodes, which could increase costs.
Failover Complexity: The failover process can introduce a small amount of latency as the system switches to a replica, although this impact is typically minimal.
Configuration Complexity: The architecture of Redis clusters with replicas and automatic failover may require careful management of instance types and scaling settings.
Default Value:
By default, automatic failover is disabled for Redis clusters. You need to configure replicas and enable automatic failover during or after the cluster creation process.
Pre-requisite:
AWS IAM Permissions:
elasticache:DescribeCacheClusters
elasticache:ModifyCacheCluster
elasticache:CreateCacheCluster
AWS CLI installed and configured.
Basic understanding of ElastiCache Redis, replica nodes, and failover configurations.
Remediation:
Test Plan:
Using AWS Console:
Sign in to the AWS Management Console.
Navigate to ElastiCache under Services.
Go to Redis clusters and select the cluster you want to inspect.
Under Cluster Settings, check the Automatic Failover setting:
Ensure that Automatic Failover is enabled. The cluster should have at least one replica node for failover to work.
If Automatic Failover is disabled, enable it and apply the changes.
Using AWS CLI:
To describe the Redis cache cluster and check if automatic failover is enabled, run:
aws elasticache describe-cache-clusters --query 'CacheClusters[*].{CacheClusterId:CacheClusterId,AutomaticFailover:AutomaticFailover}'
The output should show AutomaticFailover as enabled for clusters with automatic failover enabled: Example output:
[ { "CacheClusterId": "my-redis-cluster", "AutomaticFailover": "enabled" } ]
If AutomaticFailover is disabled, you will need to update the cluster settings.
Implementation Steps:
Using AWS Console:
Sign in to the AWS Management Console.
Navigate to ElastiCache.
In the ElastiCache Dashboard, go to Redis clusters and select the desired cluster.
Under the Cluster Settings section, locate Automatic Failover and ensure that it is enabled.
If Automatic Failover is not enabled, click on Modify and enable Automatic Failover.
Ensure that replica nodes are configured and associated with the cluster to allow failover.
Save the changes to enable automatic failover.
Using AWS CLI:
To enable automatic failover for an existing Redis cluster, run the following command:
aws elasticache modify-cache-cluster \ --cache-cluster-id <cluster-id> \ --automatic-failover-enabled true \ --apply-immediately
To verify that automatic failover is enabled, run:
aws elasticache describe-cache-clusters --query 'CacheClusters[*].{CacheClusterId:CacheClusterId,AutomaticFailover:AutomaticFailover}'
Ensure the AutomaticFailover field is set to enabled.
Backout Plan:
If enabling automatic failover causes issues, such as performance degradation or unexpected behavior:
Identify the affected ElastiCache cluster.
Disable automatic failover by running:
aws elasticache modify-cache-cluster \ --cache-cluster-id <cluster-id> \ --automatic-failover-enabled false \ --apply-immediately
Verify that automatic failover is now disabled and that the cluster is functioning as expected.
Note :
Replica Nodes: Ensure that the Redis cluster has replica nodes set up for failover to function. Without replicas, automatic failover will not be possible.
Performance Considerations: Monitor the cluster’s performance after enabling failover to ensure that the failover process does not impact application performance. Use CloudWatch metrics to monitor replication lag and other relevant parameters.