Automated Multi-Region Disaster Recovery on AWS with Terraform

My name is Joseph Mbatchou, and I am grateful for the opportunity to introduce myself to you.
I have been in the Tech industry for about 8 years, and I am currently performing as Cloud Engineer at CloudSpace Consulting, LLC Manassas, VA.
My journey began as a Computer Science Teacher and Computer Technician at Biopharcam, a computer sales company in my home country. Upon relocating to the United States, I joined Allied Universal, gradually advancing from an officer to a shift lead position. In this role, I oversaw various applications for training and operational tasks, collaborating closely with engineers and developers from Digital Realty, a data center provider.
Observing these professionals at work sparked my interest in cloud computing, leading me to pursue cloud classes, attend boot camps, and conduct in-depth research across various domains, including training at CloudSpace Academy. This exposure deepened my passion for IT and motivated me to transition into the industry to further my personal growth and contribute to organizational success.
As a Cloud Consultant, I have been privileged to collaborate with Cloud Solution Architects and DevOps teams on numerous projects, consistently exceeding customer expectations through our dedication and innovative solutions. My tenure in this role has enriched my skill set and broadened my professional experience.
Now, equipped with a wealth of knowledge in cloud computing and DevOps practices, I am eager to apply my expertise to new challenges and opportunities. I am confident in my ability to contribute effectively to your knowledge
Thank you for considering my hard work. I look forward to have you on board of the learning process.
Automated Multi-Region Disaster Recovery on AWS with Terraform
๐ Overview
Downtime is expensive. A single-region application is a single point of failure. This project eliminates that risk by deploying a Pilot Light disaster recovery architecture across two AWS regions โ with zero manual intervention required during a failover event. When the primary region goes unhealthy, the system automatically:
1- Detects the failure via CloudWatch.
2- Triggers a Lambda function through SNS .
3- Scales up the dormant DR region
4- Redirects DNS (Route 53) to the DR ALB
5 - Scales the primary region down to zero.
6 - Sends you an email notification
All of this happens in under 5 minutes.
Architecture
The project spans three AWS regions, each with a dedicated purpose:
Region Role What lives there
us-west-2 Primary VPC, ALB, ASG (2 instances), CloudWatch, Alarm, SNS Alarm Topic
eu-west-2 DR (Pilot Light) VPC, ALB, ASG
(0 instances at rest)
ca-central-1 Automation Lambda, Route 53, SNS
Notification Topic
๐๏ธ Project Structure
The Terraform code is cleanly split across purpose-driven files:
provider.tf # Three aliased AWS providers
vpc.tf # VPCs for primary + DR via terraform-aws-modules/vpc
ec2.tf # Launch Templates, Security Groups, Auto Scaling Groups
alb.tf # Application Load Balancers + Target Groups + Listeners
cloudwatch.tf # CloudWatch Alarm + SNS alarm topic + topic policy
lambda.tf # IAM Role/Policy, Lambda function, SNS subscription
route53.tf # Route 53 hosted zone lookup + A alias record
data.tf # AMI data sources + local key-pair resolution
variables.tf # All input variables with sensible defaults
outputs.tf # ALB DNS names, Lambda name, SNS ARNs
terraform.tfvars # Your actual configuration values
Normal Traffic Flow When everything is healthy:
User Request
โโโ dr.neatfleets-services.com (Route 53 A alias)
โโโ Primary ALB (us-west-2)
โโโ EC2 instances (private subnets, Apache httpd)
The DR region sits idle โ the ASG has desired_capacity = 0, so you pay nothing for EC2 until you need it. The DR ALB and networking are always up so failover is fast.
The Failover Flow
Primary ALB HealthyHostCount < 1
โโโ CloudWatch Alarm (ALARM state)
โโโ SNS Topic (us-west-2) โ
Lambda (ca-central-1)
โโโ Scale DR ASG: 0 โ 2 instances (eu-west-2)
โโโ Wait for DR ALB healthy targets (up to 4 min)
โโโ Update Route 53: dr.neatfleets-services.com โ DR ALB
โโโ Scale Primary ASG: 2 โ 0 instances
โโโ Send email notification via SNS
Key Design Decisions
Pilot Light, Not Warm Standby The DR ASG starts at zero desired capacity. This keeps costs low during normal operation while still enabling fast recovery โ EC2 boot time on Amazon Linux 2 with httpd is typically under 90 seconds.
Three-Region Separation
The automation layer (Lambda, Route 53, notification SNS) runs in a third region (ca-central-1), isolated from both the primary failure and the DR workload. This means the failover brain is not affected by the outage it is responding to.
Health-Check-Gated DNS Swap
The Lambda does not blindly flip DNS. It polls the DR target group every 15 seconds (up to 4 minutes) and only updates Route 53 once at least 2 healthy targets are confirmed. This prevents a DNS swap to a region that hasn't finished warming up.
Cross-Region SNS Invocation
The CloudWatch alarm in us-west-2 publishes to an SNS topic in us-west-2. That topic has a Lambda subscription pointing at the function in ca-central-1. The SNS topic policy explicitly allows CloudWatch to publish from the primary account, preventing unauthorized invocations.
Prerequisites
Before you run this, make sure you have:
AWS CLI configured with credentials that have admin-level permissions -
Terraform >= 1.0 installed (download)
A public Route 53 hosted zone already created in your account (e.g., neatfleets-services.com)
EC2 key pairs already created in both us-west-2 (primary) and eu-west-2 (DR) if you want SSH access
The lambda_function.py file present in the project root (it is included โ Terraform packages it at plan time)
โ๏ธ Setup & Deployment
Step 1 โ Clone the Project
git clone https://github.com/Joebaho/AWS-Multi-Region-DR-TF.git
cd AWS-Multi-Region-DR-TF
Step 2 โ Configure Your Variables
Edit terraform.tfvars with your real values:
record_name = "dr.your-domain.com"
instance_type = "t3.micro"
primary_key_name = "your-us-west-2-keypair"
dr_key_name = "your-eu-west-2-keypair"
notification_email = "your email address"
Important: hosted_zone_name must already exist as a public hosted zone in your AWS account. This project does not create the hosted zone โ it only adds a record to it.
Step 3 โ Initialize Terraform
terraform init
This downloads the AWS and Archive providers and pulls the terraform-aws-modules/vpc module for both regions. Expect it to take 30โ60 seconds.
Step 4 โ Validate the Configuration
terraform fmt
terraform validate
You should see: Success! The configuration is valid.
Step 5 โ Preview the Plan
terraform plan -out=tfplan
Review the output. You will see roughly 70โ80 resources being created across the three regions. Key ones to look for:
module.vpc_primary and module.vpc_dr โ two full VPCs
aws_lb.primary and aws_lb.dr โ two Application Load Balancers
aws_autoscaling_group.primary (desired: 2) and aws_autoscaling_group.dr (desired: 0)
aws_lambda_function.failover โ the automation brain
aws_route53_record.app โ the DNS record
Step 6 โ Apply
terraform apply "tfplan"
This takes approximately 5โ10 minutes due to NAT Gateway provisioning and ALB setup. Grab a coffee.
Step 7 โ Confirm the Email Subscription
Check your inbox for a "AWS Notification โ Subscription Confirmation" email from SNS and click Confirm Subscription. Without this, you won't receive failover notifications.
After you confirm your subcription you will land on this page
Step 8 โ Verify the Deployment
terraform output
You'll see something like:
dr_alb_dns = "dr-secondary-alb-xxxx.eu-west-2.elb.amazonaws.com"
failover_lambda_name = "dr-failover-handler" notification_topic_arn = "arn:aws:sns:ca-central-1:..."
Open your browser and visit http://dr.your-domain.com โ you should see "Hello from Primary Region".
Testing the Failover
To simulate a primary region failure, set the primary ASG desired capacity to zero: ```
aws autoscaling update-auto-scaling-group
--region us-west-2
--auto-scaling-group-name dr-primary-asg
--min-size 0
--desired-capacity 0
Wait about 2 minutes for CloudWatch to fire (2 evaluation periods ร 60 seconds). Then watch:
1 - The CloudWatch alarm transitions to ALARM
2 - SNS triggers Lambda
3 - Lambda scales the DR ASG up to 2
4 - Route 53 flips to the DR ALB
5 - You receive a notification email
Refresh http://dr.your-domain.com โ you should now see "Hello from DR Region".
Cleanup
To avoid ongoing charges, destroy everything when done:
terraform destroy -auto-approve
Cost note: The most expensive resources while running are the NAT Gateways (one per region =
\(0.045/hr each) and the ALBs (\)0.008/hr each). The EC2 instances are t3.micro and free-tier eligible. Estimated cost for a full test: under $2 for a few hours.
What This Project Does NOT Cover
This is a demo for infrastructure DR, not a full production DR solution. A real production setup would also need:
Database replication (RDS Multi-Region read replicas or Aurora Global Database)
S3 Cross-Region Replication for object storage
ACM Certificate replication and HTTPS listeners
Failback automation (returning traffic to primary after recovery)
Stateful session handling (sticky sessions or distributed session store)
Conclusion
This project demonstrates a clean separation of concerns across regions โ primary workload, DR workload, and automation all running independently. The failover is fully hands-off, observable via CloudWatch and email, and the code is minimal enough to understand end-to-end in an afternoon. If you found this useful, feel free to fork it, adapt it for your stack, and share your improvements.
๐ค Contributing
Your perspective is valuable! Whether you see potential for improvement or appreciate what's already here, your contributions are welcomed and appreciated. Thank you for considering joining us in making this project even better. Feel free to follow me for updates on this project and others, and to explore opportunities for collaboration. Together, we can create something amazing!
๐ License
This project is licensed under the JoebahoCloud License



