# Automated Multi-Region Disaster Recovery on AWS with Terraform



# Automated Multi-Region Disaster Recovery on AWS with Terraform

## **📖 Overview**

Downtime is expensive. A single-region application is a single point of failure. This project eliminates that risk by deploying a Pilot Light disaster recovery architecture across two AWS regions — with zero manual intervention required during a failover event. When the primary region goes unhealthy, the system automatically:

1- Detects the failure via CloudWatch.

2- Triggers a Lambda function through SNS .

3- Scales up the dormant DR region

4- Redirects DNS (Route 53) to the DR ALB

5 - Scales the primary region down to zero.

6 - Sends you an email notification

All of this happens in under 5 minutes.

## **Architecture**

The project spans three AWS regions, each with a dedicated purpose:

Region Role What lives there

us-west-2 Primary VPC, ALB, ASG (2 instances), CloudWatch, Alarm, SNS Alarm Topic

eu-west-2 DR (Pilot Light) VPC, ALB, ASG

(0 instances at rest)

ca-central-1 Automation Lambda, Route 53, SNS

Notification Topic

![](https://cdn.hashnode.com/uploads/covers/6605b33d2b011c6238012384/43c1a3cb-dd0c-462d-86fe-6cd5da12dfc5.png align="center")

## **🗂️ Project Structure**

The Terraform code is cleanly split across purpose-driven files:

[provider.tf](http://provider.tf) # Three aliased AWS providers

[vpc.tf](http://vpc.tf) # VPCs for primary + DR via terraform-aws-modules/vpc

[ec2.tf](http://ec2.tf) # Launch Templates, Security Groups, Auto Scaling Groups

[alb.tf](http://alb.tf) # Application Load Balancers + Target Groups + Listeners

[cloudwatch.tf](http://cloudwatch.tf) # CloudWatch Alarm + SNS alarm topic + topic policy

[lambda.tf](http://lambda.tf) # IAM Role/Policy, Lambda function, SNS subscription

[route53.tf](http://route53.tf) # Route 53 hosted zone lookup + A alias record

[data.tf](http://data.tf) # AMI data sources + local key-pair resolution

[variables.tf](http://variables.tf) # All input variables with sensible defaults

[outputs.tf](http://outputs.tf) # ALB DNS names, Lambda name, SNS ARNs

terraform.tfvars # Your actual configuration values

***Normal Traffic Flow When everything is healthy:***

User Request

└── dr.neatfleets-services.com (Route 53 A alias)

└── Primary ALB (us-west-2)

└── EC2 instances (private subnets, Apache httpd)

The DR region sits idle — the ASG has desired\_capacity = 0, so you pay nothing for EC2 until you need it. The DR ALB and networking are always up so failover is fast.

***The Failover Flow***

Primary ALB HealthyHostCount < 1

└── CloudWatch Alarm (ALARM state)

└── SNS Topic (us-west-2) →

Lambda (ca-central-1)

├── Scale DR ASG: 0 → 2 instances (eu-west-2)

├── Wait for DR ALB healthy targets (up to 4 min)

├── Update Route 53: dr.neatfleets-services.com → DR ALB

├── Scale Primary ASG: 2 → 0 instances

└── Send email notification via SNS

**Key Design Decisions**

Pilot Light, Not Warm Standby The DR ASG starts at zero desired capacity. This keeps costs low during normal operation while still enabling fast recovery — EC2 boot time on Amazon Linux 2 with httpd is typically under 90 seconds.

**Three-Region Separation**

The automation layer (Lambda, Route 53, notification SNS) runs in a third region (ca-central-1), isolated from both the primary failure and the DR workload. This means the failover brain is not affected by the outage it is responding to.

**Health-Check-Gated DNS Swap**

The Lambda does not blindly flip DNS. It polls the DR target group every 15 seconds (up to 4 minutes) and only updates Route 53 once at least 2 healthy targets are confirmed. This prevents a DNS swap to a region that hasn't finished warming up.

**Cross-Region SNS Invocation**

The CloudWatch alarm in us-west-2 publishes to an SNS topic in us-west-2. That topic has a Lambda subscription pointing at the function in ca-central-1. The SNS topic policy explicitly allows CloudWatch to publish from the primary account, preventing unauthorized invocations.

## Prerequisites

Before you run this, make sure you have:

*   AWS CLI configured with credentials that have admin-level permissions -
    
*   Terraform >= 1.0 installed (download)
    
*   A public Route 53 hosted zone already created in your account (e.g., neatfleets-services.com)
    
*   EC2 key pairs already created in both us-west-2 (primary) and eu-west-2 (DR) if you want SSH access
    
*   The lambda\_function.py file present in the project root (it is included — Terraform packages it at plan time)
    

## **⚙️ Setup & Deployment**

**Step 1 — Clone the Project**

```bash
git clone https://github.com/Joebaho/AWS-Multi-Region-DR-TF.git

cd AWS-Multi-Region-DR-TF 
```

**Step 2 — Configure Your Variables**

Edit terraform.tfvars with your real values:

```bash
record_name = "dr.your-domain.com" 

instance_type = "t3.micro" 

primary_key_name = "your-us-west-2-keypair" 

dr_key_name = "your-eu-west-2-keypair" 

notification_email = "your email address"
```

> <mark class="bg-yellow-200 dark:bg-yellow-500/30">Important</mark>: hosted\_zone\_name must already exist as a public hosted zone in your AWS account. This project does not create the hosted zone — it only adds a record to it.

**Step 3 — Initialize Terraform**

```plaintext
terraform init 
```

This downloads the AWS and Archive providers and pulls the terraform-aws-modules/vpc module for both regions. Expect it to take 30–60 seconds.

![](https://cdn.hashnode.com/uploads/covers/6605b33d2b011c6238012384/abd0813e-6bc0-4870-a3b4-7dad043eae1a.png align="center")

**Step 4 — Validate the Configuration**

```plaintext
terraform fmt
terraform validate
```

You should see: Success! The configuration is valid.

![](https://cdn.hashnode.com/uploads/covers/6605b33d2b011c6238012384/24d94de2-2970-45ee-aa24-915ed521b691.png align="center")

**Step 5 — Preview the Plan**

```bash
terraform plan -out=tfplan
```

![](https://cdn.hashnode.com/uploads/covers/6605b33d2b011c6238012384/dcf34dd1-cc8f-430b-aa5f-57bd885bcbde.png align="center")

Review the output. You will see roughly 70–80 resources being created across the three regions. Key ones to look for:

*   module.vpc\_primary and module.vpc\_dr — two full VPCs
    
*   aws\_lb.primary and aws\_lb.dr — two Application Load Balancers
    
*   aws\_autoscaling\_group.primary (desired: 2) and aws\_autoscaling\_group.dr (desired: 0)
    
*   aws\_lambda\_function.failover — the automation brain
    
*   aws\_route53\_record.app — the DNS record
    

**Step 6 — Apply**

```plaintext
terraform apply "tfplan" 
```

This takes approximately 5–10 minutes due to NAT Gateway provisioning and ALB setup. Grab a coffee.

![](https://cdn.hashnode.com/uploads/covers/6605b33d2b011c6238012384/966ed6ef-8b7c-4919-9515-c977b1038c7e.png align="center")

**Step 7 — Confirm the Email Subscription**

Check your inbox for a "AWS Notification — Subscription Confirmation" email from SNS and click Confirm Subscription. Without this, you won't receive failover notifications.

![](https://cdn.hashnode.com/uploads/covers/6605b33d2b011c6238012384/01302dc9-51ac-4acd-ba82-d79820dcc000.png align="center")

After you confirm your subcription you will land on this page

![](https://cdn.hashnode.com/uploads/covers/6605b33d2b011c6238012384/9c34da21-f31e-4cba-9e2d-a0b4cdfd1bb2.png align="center")

**Step 8 — Verify the Deployment**

```plaintext
terraform output 
```

You'll see something like:

```bash
dr_alb_dns = "dr-secondary-alb-xxxx.eu-west-2.elb.amazonaws.com" 
failover_lambda_name = "dr-failover-handler" notification_topic_arn = "arn:aws:sns:ca-central-1:..." 
```

![](https://cdn.hashnode.com/uploads/covers/6605b33d2b011c6238012384/77b584a4-854f-4b87-91ec-4952155017c4.png align="center")

Open your browser and visit http://dr.your-domain.com — you should see "Hello from Primary Region".

![](https://cdn.hashnode.com/uploads/covers/6605b33d2b011c6238012384/1f1fa639-7083-4b8c-a5f9-7da7c1950ef0.png align="center")

![](https://cdn.hashnode.com/uploads/covers/6605b33d2b011c6238012384/7bd695bf-7237-452a-b2c9-bff3f0546d3d.png align="center")

## Testing the Failover

To simulate a primary region failure, set the primary ASG desired capacity to zero: \`\`\`

```plaintext
aws autoscaling update-auto-scaling-group
--region us-west-2
--auto-scaling-group-name dr-primary-asg
--min-size 0
--desired-capacity 0 
```

Wait about 2 minutes for CloudWatch to fire (2 evaluation periods × 60 seconds). Then watch:

1 - The CloudWatch alarm transitions to ALARM

2 - SNS triggers Lambda

3 - Lambda scales the DR ASG up to 2

4 - Route 53 flips to the DR ALB

5 - You receive a notification email

Refresh http://dr.your-domain.com — you should now see "Hello from DR Region".

![](https://cdn.hashnode.com/uploads/covers/6605b33d2b011c6238012384/0329febd-d377-4b33-a672-79350c1d2c5d.png align="center")

![](https://cdn.hashnode.com/uploads/covers/6605b33d2b011c6238012384/ada431f4-2be1-44c2-98b2-c6e8b86dc94c.png align="center")

![](https://cdn.hashnode.com/uploads/covers/6605b33d2b011c6238012384/352ecbcf-cc7c-4e53-bdda-dd53552e955d.png align="center")

![](https://cdn.hashnode.com/uploads/covers/6605b33d2b011c6238012384/cc0de9a9-cbca-4bbf-8560-8c8078e5f255.png align="center")

## Cleanup

To avoid ongoing charges, destroy everything when done:

```plaintext
terraform destroy -auto-approve
```

> Cost note: The most expensive resources while running are the NAT Gateways (one per region = ~$0.045/hr each) and the ALBs (~$0.008/hr each). The EC2 instances are t3.micro and free-tier eligible. Estimated cost for a full test: under $2 for a few hours.

![](https://cdn.hashnode.com/uploads/covers/6605b33d2b011c6238012384/191440cb-d818-4150-b650-4dce995022d8.png align="center")

What This Project Does NOT Cover

This is a demo for infrastructure DR, not a full production DR solution. A real production setup would also need:

*   Database replication (RDS Multi-Region read replicas or Aurora Global Database)
    
*   S3 Cross-Region Replication for object storage
    
*   ACM Certificate replication and HTTPS listeners
    
*   Failback automation (returning traffic to primary after recovery)
    
*   Stateful session handling (sticky sessions or distributed session store)
    

## **Conclusion**

This project demonstrates a clean separation of concerns across regions — primary workload, DR workload, and automation all running independently. The failover is fully hands-off, observable via CloudWatch and email, and the code is minimal enough to understand end-to-end in an afternoon. If you found this useful, feel free to fork it, adapt it for your stack, and share your improvements.

## **🤝 Contributing**

Your perspective is valuable! Whether you see potential for improvement or appreciate what's already here, your contributions are welcomed and appreciated. Thank you for considering joining us in making this project even better. Feel free to follow me for updates on this project and others, and to explore opportunities for collaboration. Together, we can create something amazing!

## **📄 License**

This project is licensed under the JoebahoCloud License
