Skip to main content

Command Palette

Search for a command to run...

Automated Multi-Region Disaster Recovery on AWS with Terraform

Published
โ€ข7 min read
Automated Multi-Region Disaster Recovery on AWS with Terraform
J

My name is Joseph Mbatchou, and I am grateful for the opportunity to introduce myself to you.

I have been in the Tech industry for about 8 years, and I am currently performing as Cloud Engineer at CloudSpace Consulting, LLC Manassas, VA.

My journey began as a Computer Science Teacher and Computer Technician at Biopharcam, a computer sales company in my home country. Upon relocating to the United States, I joined Allied Universal, gradually advancing from an officer to a shift lead position. In this role, I oversaw various applications for training and operational tasks, collaborating closely with engineers and developers from Digital Realty, a data center provider.

Observing these professionals at work sparked my interest in cloud computing, leading me to pursue cloud classes, attend boot camps, and conduct in-depth research across various domains, including training at CloudSpace Academy. This exposure deepened my passion for IT and motivated me to transition into the industry to further my personal growth and contribute to organizational success.

As a Cloud Consultant, I have been privileged to collaborate with Cloud Solution Architects and DevOps teams on numerous projects, consistently exceeding customer expectations through our dedication and innovative solutions. My tenure in this role has enriched my skill set and broadened my professional experience.

Now, equipped with a wealth of knowledge in cloud computing and DevOps practices, I am eager to apply my expertise to new challenges and opportunities. I am confident in my ability to contribute effectively to your knowledge

Thank you for considering my hard work. I look forward to have you on board of the learning process.

Automated Multi-Region Disaster Recovery on AWS with Terraform

๐Ÿ“– Overview

Downtime is expensive. A single-region application is a single point of failure. This project eliminates that risk by deploying a Pilot Light disaster recovery architecture across two AWS regions โ€” with zero manual intervention required during a failover event. When the primary region goes unhealthy, the system automatically:

1- Detects the failure via CloudWatch.

2- Triggers a Lambda function through SNS .

3- Scales up the dormant DR region

4- Redirects DNS (Route 53) to the DR ALB

5 - Scales the primary region down to zero.

6 - Sends you an email notification

All of this happens in under 5 minutes.

Architecture

The project spans three AWS regions, each with a dedicated purpose:

Region Role What lives there

us-west-2 Primary VPC, ALB, ASG (2 instances), CloudWatch, Alarm, SNS Alarm Topic

eu-west-2 DR (Pilot Light) VPC, ALB, ASG

(0 instances at rest)

ca-central-1 Automation Lambda, Route 53, SNS

Notification Topic

๐Ÿ—‚๏ธ Project Structure

The Terraform code is cleanly split across purpose-driven files:

provider.tf # Three aliased AWS providers

vpc.tf # VPCs for primary + DR via terraform-aws-modules/vpc

ec2.tf # Launch Templates, Security Groups, Auto Scaling Groups

alb.tf # Application Load Balancers + Target Groups + Listeners

cloudwatch.tf # CloudWatch Alarm + SNS alarm topic + topic policy

lambda.tf # IAM Role/Policy, Lambda function, SNS subscription

route53.tf # Route 53 hosted zone lookup + A alias record

data.tf # AMI data sources + local key-pair resolution

variables.tf # All input variables with sensible defaults

outputs.tf # ALB DNS names, Lambda name, SNS ARNs

terraform.tfvars # Your actual configuration values

Normal Traffic Flow When everything is healthy:

User Request

โ””โ”€โ”€ dr.neatfleets-services.com (Route 53 A alias)

โ””โ”€โ”€ Primary ALB (us-west-2)

โ””โ”€โ”€ EC2 instances (private subnets, Apache httpd)

The DR region sits idle โ€” the ASG has desired_capacity = 0, so you pay nothing for EC2 until you need it. The DR ALB and networking are always up so failover is fast.

The Failover Flow

Primary ALB HealthyHostCount < 1

โ””โ”€โ”€ CloudWatch Alarm (ALARM state)

โ””โ”€โ”€ SNS Topic (us-west-2) โ†’

Lambda (ca-central-1)

โ”œโ”€โ”€ Scale DR ASG: 0 โ†’ 2 instances (eu-west-2)

โ”œโ”€โ”€ Wait for DR ALB healthy targets (up to 4 min)

โ”œโ”€โ”€ Update Route 53: dr.neatfleets-services.com โ†’ DR ALB

โ”œโ”€โ”€ Scale Primary ASG: 2 โ†’ 0 instances

โ””โ”€โ”€ Send email notification via SNS

Key Design Decisions

Pilot Light, Not Warm Standby The DR ASG starts at zero desired capacity. This keeps costs low during normal operation while still enabling fast recovery โ€” EC2 boot time on Amazon Linux 2 with httpd is typically under 90 seconds.

Three-Region Separation

The automation layer (Lambda, Route 53, notification SNS) runs in a third region (ca-central-1), isolated from both the primary failure and the DR workload. This means the failover brain is not affected by the outage it is responding to.

Health-Check-Gated DNS Swap

The Lambda does not blindly flip DNS. It polls the DR target group every 15 seconds (up to 4 minutes) and only updates Route 53 once at least 2 healthy targets are confirmed. This prevents a DNS swap to a region that hasn't finished warming up.

Cross-Region SNS Invocation

The CloudWatch alarm in us-west-2 publishes to an SNS topic in us-west-2. That topic has a Lambda subscription pointing at the function in ca-central-1. The SNS topic policy explicitly allows CloudWatch to publish from the primary account, preventing unauthorized invocations.

Prerequisites

Before you run this, make sure you have:

  • AWS CLI configured with credentials that have admin-level permissions -

  • Terraform >= 1.0 installed (download)

  • A public Route 53 hosted zone already created in your account (e.g., neatfleets-services.com)

  • EC2 key pairs already created in both us-west-2 (primary) and eu-west-2 (DR) if you want SSH access

  • The lambda_function.py file present in the project root (it is included โ€” Terraform packages it at plan time)

โš™๏ธ Setup & Deployment

Step 1 โ€” Clone the Project

git clone https://github.com/Joebaho/AWS-Multi-Region-DR-TF.git

cd AWS-Multi-Region-DR-TF 

Step 2 โ€” Configure Your Variables

Edit terraform.tfvars with your real values:

record_name = "dr.your-domain.com" 

instance_type = "t3.micro" 

primary_key_name = "your-us-west-2-keypair" 

dr_key_name = "your-eu-west-2-keypair" 

notification_email = "your email address"

Important: hosted_zone_name must already exist as a public hosted zone in your AWS account. This project does not create the hosted zone โ€” it only adds a record to it.

Step 3 โ€” Initialize Terraform

terraform init 

This downloads the AWS and Archive providers and pulls the terraform-aws-modules/vpc module for both regions. Expect it to take 30โ€“60 seconds.

Step 4 โ€” Validate the Configuration

terraform fmt
terraform validate

You should see: Success! The configuration is valid.

Step 5 โ€” Preview the Plan

terraform plan -out=tfplan

Review the output. You will see roughly 70โ€“80 resources being created across the three regions. Key ones to look for:

  • module.vpc_primary and module.vpc_dr โ€” two full VPCs

  • aws_lb.primary and aws_lb.dr โ€” two Application Load Balancers

  • aws_autoscaling_group.primary (desired: 2) and aws_autoscaling_group.dr (desired: 0)

  • aws_lambda_function.failover โ€” the automation brain

  • aws_route53_record.app โ€” the DNS record

Step 6 โ€” Apply

terraform apply "tfplan" 

This takes approximately 5โ€“10 minutes due to NAT Gateway provisioning and ALB setup. Grab a coffee.

Step 7 โ€” Confirm the Email Subscription

Check your inbox for a "AWS Notification โ€” Subscription Confirmation" email from SNS and click Confirm Subscription. Without this, you won't receive failover notifications.

After you confirm your subcription you will land on this page

Step 8 โ€” Verify the Deployment

terraform output 

You'll see something like:

dr_alb_dns = "dr-secondary-alb-xxxx.eu-west-2.elb.amazonaws.com" 
failover_lambda_name = "dr-failover-handler" notification_topic_arn = "arn:aws:sns:ca-central-1:..." 

Open your browser and visit http://dr.your-domain.com โ€” you should see "Hello from Primary Region".

Testing the Failover

To simulate a primary region failure, set the primary ASG desired capacity to zero: ```

aws autoscaling update-auto-scaling-group
--region us-west-2
--auto-scaling-group-name dr-primary-asg
--min-size 0
--desired-capacity 0 

Wait about 2 minutes for CloudWatch to fire (2 evaluation periods ร— 60 seconds). Then watch:

1 - The CloudWatch alarm transitions to ALARM

2 - SNS triggers Lambda

3 - Lambda scales the DR ASG up to 2

4 - Route 53 flips to the DR ALB

5 - You receive a notification email

Refresh http://dr.your-domain.com โ€” you should now see "Hello from DR Region".

Cleanup

To avoid ongoing charges, destroy everything when done:

terraform destroy -auto-approve

Cost note: The most expensive resources while running are the NAT Gateways (one per region = \(0.045/hr each) and the ALBs (\)0.008/hr each). The EC2 instances are t3.micro and free-tier eligible. Estimated cost for a full test: under $2 for a few hours.

What This Project Does NOT Cover

This is a demo for infrastructure DR, not a full production DR solution. A real production setup would also need:

  • Database replication (RDS Multi-Region read replicas or Aurora Global Database)

  • S3 Cross-Region Replication for object storage

  • ACM Certificate replication and HTTPS listeners

  • Failback automation (returning traffic to primary after recovery)

  • Stateful session handling (sticky sessions or distributed session store)

Conclusion

This project demonstrates a clean separation of concerns across regions โ€” primary workload, DR workload, and automation all running independently. The failover is fully hands-off, observable via CloudWatch and email, and the code is minimal enough to understand end-to-end in an afternoon. If you found this useful, feel free to fork it, adapt it for your stack, and share your improvements.

๐Ÿค Contributing

Your perspective is valuable! Whether you see potential for improvement or appreciate what's already here, your contributions are welcomed and appreciated. Thank you for considering joining us in making this project even better. Feel free to follow me for updates on this project and others, and to explore opportunities for collaboration. Together, we can create something amazing!

๐Ÿ“„ License

This project is licensed under the JoebahoCloud License

More from this blog

Joseph Mbatchou Cloud Platform

15 posts

My name is Joseph Mbatchou, and I am grateful for the opportunity to introduce myself to you. I have been in the Tech industry for about 8 years, and I am currently performing as Cloud Engineer.