Automated Multi-Region Disaster Recovery on AWS with Terraform

📖 Overview

Downtime is expensive. A single-region application is a single point of failure. This project eliminates that risk by deploying a Pilot Light disaster recovery architecture across two AWS regions — with zero manual intervention required during a failover event. When the primary region goes unhealthy, the system automatically:

1- Detects the failure via CloudWatch.

2- Triggers a Lambda function through SNS .

3- Scales up the dormant DR region

4- Redirects DNS (Route 53) to the DR ALB

5 - Scales the primary region down to zero.

6 - Sends you an email notification

All of this happens in under 5 minutes.

Architecture

The project spans three AWS regions, each with a dedicated purpose:

Region Role What lives there

us-west-2 Primary VPC, ALB, ASG (2 instances), CloudWatch, Alarm, SNS Alarm Topic

eu-west-2 DR (Pilot Light) VPC, ALB, ASG

(0 instances at rest)

ca-central-1 Automation Lambda, Route 53, SNS

Notification Topic

🗂️ Project Structure

The Terraform code is cleanly split across purpose-driven files:

provider.tf # Three aliased AWS providers

vpc.tf # VPCs for primary + DR via terraform-aws-modules/vpc

ec2.tf # Launch Templates, Security Groups, Auto Scaling Groups

alb.tf # Application Load Balancers + Target Groups + Listeners

cloudwatch.tf # CloudWatch Alarm + SNS alarm topic + topic policy

lambda.tf # IAM Role/Policy, Lambda function, SNS subscription

route53.tf # Route 53 hosted zone lookup + A alias record

data.tf # AMI data sources + local key-pair resolution

variables.tf # All input variables with sensible defaults

outputs.tf # ALB DNS names, Lambda name, SNS ARNs

terraform.tfvars # Your actual configuration values

Normal Traffic Flow When everything is healthy:

User Request

└── dr.neatfleets-services.com (Route 53 A alias)

└── Primary ALB (us-west-2)

└── EC2 instances (private subnets, Apache httpd)

The DR region sits idle — the ASG has desired_capacity = 0, so you pay nothing for EC2 until you need it. The DR ALB and networking are always up so failover is fast.

The Failover Flow

Primary ALB HealthyHostCount < 1

└── CloudWatch Alarm (ALARM state)

└── SNS Topic (us-west-2) →

Lambda (ca-central-1)

├── Scale DR ASG: 0 → 2 instances (eu-west-2)

├── Wait for DR ALB healthy targets (up to 4 min)

├── Update Route 53: dr.neatfleets-services.com → DR ALB

├── Scale Primary ASG: 2 → 0 instances

└── Send email notification via SNS

Key Design Decisions

Pilot Light, Not Warm Standby The DR ASG starts at zero desired capacity. This keeps costs low during normal operation while still enabling fast recovery — EC2 boot time on Amazon Linux 2 with httpd is typically under 90 seconds.

Three-Region Separation

The automation layer (Lambda, Route 53, notification SNS) runs in a third region (ca-central-1), isolated from both the primary failure and the DR workload. This means the failover brain is not affected by the outage it is responding to.

Health-Check-Gated DNS Swap

The Lambda does not blindly flip DNS. It polls the DR target group every 15 seconds (up to 4 minutes) and only updates Route 53 once at least 2 healthy targets are confirmed. This prevents a DNS swap to a region that hasn't finished warming up.

Cross-Region SNS Invocation

The CloudWatch alarm in us-west-2 publishes to an SNS topic in us-west-2. That topic has a Lambda subscription pointing at the function in ca-central-1. The SNS topic policy explicitly allows CloudWatch to publish from the primary account, preventing unauthorized invocations.

Prerequisites

Before you run this, make sure you have:

AWS CLI configured with credentials that have admin-level permissions -
Terraform >= 1.0 installed (download)
A public Route 53 hosted zone already created in your account (e.g., neatfleets-services.com)
EC2 key pairs already created in both us-west-2 (primary) and eu-west-2 (DR) if you want SSH access
The lambda_function.py file present in the project root (it is included — Terraform packages it at plan time)

⚙️ Setup & Deployment

Step 1 — Clone the Project

git clone https://github.com/Joebaho/AWS-Multi-Region-DR-TF.git

cd AWS-Multi-Region-DR-TF

Step 2 — Configure Your Variables

Edit terraform.tfvars with your real values:

record_name = "dr.your-domain.com" 

instance_type = "t3.micro" 

primary_key_name = "your-us-west-2-keypair" 

dr_key_name = "your-eu-west-2-keypair" 

notification_email = "your email address"

Important: hosted_zone_name must already exist as a public hosted zone in your AWS account. This project does not create the hosted zone — it only adds a record to it.

Step 3 — Initialize Terraform

terraform init

This downloads the AWS and Archive providers and pulls the terraform-aws-modules/vpc module for both regions. Expect it to take 30–60 seconds.

Step 4 — Validate the Configuration

terraform fmt
terraform validate

You should see: Success! The configuration is valid.

Step 5 — Preview the Plan

terraform plan -out=tfplan

Review the output. You will see roughly 70–80 resources being created across the three regions. Key ones to look for:

module.vpc_primary and module.vpc_dr — two full VPCs
aws_lb.primary and aws_lb.dr — two Application Load Balancers
aws_autoscaling_group.primary (desired: 2) and aws_autoscaling_group.dr (desired: 0)
aws_lambda_function.failover — the automation brain
aws_route53_record.app — the DNS record

Step 6 — Apply

terraform apply "tfplan"

This takes approximately 5–10 minutes due to NAT Gateway provisioning and ALB setup. Grab a coffee.

Step 7 — Confirm the Email Subscription

Check your inbox for a "AWS Notification — Subscription Confirmation" email from SNS and click Confirm Subscription. Without this, you won't receive failover notifications.

After you confirm your subcription you will land on this page

Step 8 — Verify the Deployment

terraform output

You'll see something like:

dr_alb_dns = "dr-secondary-alb-xxxx.eu-west-2.elb.amazonaws.com" 
failover_lambda_name = "dr-failover-handler" notification_topic_arn = "arn:aws:sns:ca-central-1:..."

Open your browser and visit http://dr.your-domain.com — you should see "Hello from Primary Region".

Testing the Failover

To simulate a primary region failure, set the primary ASG desired capacity to zero: ```

aws autoscaling update-auto-scaling-group
--region us-west-2
--auto-scaling-group-name dr-primary-asg
--min-size 0
--desired-capacity 0

Wait about 2 minutes for CloudWatch to fire (2 evaluation periods × 60 seconds). Then watch:

1 - The CloudWatch alarm transitions to ALARM

2 - SNS triggers Lambda

3 - Lambda scales the DR ASG up to 2

4 - Route 53 flips to the DR ALB

5 - You receive a notification email

Refresh http://dr.your-domain.com — you should now see "Hello from DR Region".

Cleanup

To avoid ongoing charges, destroy everything when done:

terraform destroy -auto-approve

Cost note: The most expensive resources while running are the NAT Gateways (one per region = ~~$0.045/hr each) and the ALBs (~~$0.008/hr each). The EC2 instances are t3.micro and free-tier eligible. Estimated cost for a full test: under $2 for a few hours.

What This Project Does NOT Cover

This is a demo for infrastructure DR, not a full production DR solution. A real production setup would also need:

Database replication (RDS Multi-Region read replicas or Aurora Global Database)
S3 Cross-Region Replication for object storage
ACM Certificate replication and HTTPS listeners
Failback automation (returning traffic to primary after recovery)
Stateful session handling (sticky sessions or distributed session store)

Conclusion

This project demonstrates a clean separation of concerns across regions — primary workload, DR workload, and automation all running independently. The failover is fully hands-off, observable via CloudWatch and email, and the code is minimal enough to understand end-to-end in an afternoon. If you found this useful, feel free to fork it, adapt it for your stack, and share your improvements.

🤝 Contributing

Your perspective is valuable! Whether you see potential for improvement or appreciate what's already here, your contributions are welcomed and appreciated. Thank you for considering joining us in making this project even better. Feel free to follow me for updates on this project and others, and to explore opportunities for collaboration. Together, we can create something amazing!

📄 License

This project is licensed under the JoebahoCloud License

Automated Multi-Region Disaster Recovery on AWS with Terraform

Automated Multi-Region Disaster Recovery on AWS with Terraform

📖 Overview

Architecture

🗂️ Project Structure

Prerequisites

⚙️ Setup & Deployment

Testing the Failover

Cleanup

Conclusion

🤝 Contributing

📄 License

Comments

More from this blog

Terraform: Mount S3 Bucket on EC2 (Ubuntu) using s3fs‑fuse

GitOps-Powered EKS: Automating Kubernetes Deployments with Terraform, ArgoCD, and Monitoring Tools accessible via Load balancer

🚀 Automating Scalable Infrastructure with Terraform & Ansible Dynamic Inventory

Web Application Deployment on AWS Using Terraform, NGINX and Bash Scripting

Command Palette

Automated Multi-Region Disaster Recovery on AWS with Terraform

📖 Overview

Architecture

🗂️ Project Structure

Prerequisites

⚙️ Setup & Deployment

Testing the Failover

Cleanup

Conclusion

🤝 Contributing

📄 License

Comments

More from this blog