AWS Domain 1.0: High Availability and Business Continuity

AWS Solutions Certified Architect Professional Notes

1.1 Demonstrate ability to architect the appropriate level of availability based on stakeholder requirements

AWS uses a diagram of a spectrum for disaster recovery options arranged into 4 groups

A) Backup and Restore

Data is backed up to off site regularly. It can take a long time to restore the systems in the event of a disaster. Amazon S3 is ideal for backing up data that might need to be restored quickly. Amazon S3 backup is done through the network and can be accessible from any location

AWS Import/Export can transfer large data sets by shipping storage devices to AWS.

Longer term storage where retrieval times are several hours are adequate, ther is Amazon Glacier. Same durability as Amazon S3. Glacier is a lo coat alternative. S3 and Glacier can be used to produce a tiered backup solutions

AWS Storage Gateway enables snapshots of on premise data volumes to be transparently copied to S3 for backup

Storage Cached volumes allow you to store your primary data on S3 but keep frequently accessed data local.

Gateway VTL is a backup target for existing backup management software. Replace form traditional magnetic tape backup.

– now the storage is made up of

https://d0.awsstatic.com/whitepapers/Storage/aws-storage-gateway-file-gateway-for-hybrid-architectures.pdf

1) Files

Simple solution for presenting one or more S3 buckets and their objects as a mountable NFS to one or more clients

Deployed as a virtual appliance either to VMWare or and EC2 instance.

Use Cases:

Cloud Tiering – on premise environment where storage resources are reaching capacity

Hybrid Cloud Backup – can be used to as a backup volume

2) volumes

a) cached

b) stored

3) Virtual tapes

a) library (S3 storage)

b) shelf ( retrieval not for 24 hours )

Key Steps for backup and restore:

1) Select an appropriate tool or method to backup your data into AWS

2) Ensure that you have an appropriate retention policy for this database

3) Ensure that appropriate security measures are in place for this data, including encryption and access policies

4) Regularly test the recovery of this data and restoration of your system.

B) Pilot light for quick recovery into AWS

A minimal version of the environment is always running in the cloud. Configure and urn the most critical core elements of the system in AWS. When time comes form recovery, you can rapidly provision a full scale production environment around the critical core.

Typically, the system would have some pre-configured servers bundled as AMIs, which are ready to be started up at a moments notice.

Key Steps for preparation phase:

1) Setup Amazon EC2 instances to replicate or mirror database

2) Ensure that you have all supporting custom software packages available in AWS.

3) Create and maintain AMIs of key server where fast recover is required.

4) Regularly run these servers, test them, and apply any software updates and configuration changes

5) Consider automating the provisioning of AWS resources

Key Steps for recovery phase:

1) Start your application EC2 instance from your custom AMIs.

2) Resize existing database/data store instances to process the increased traffic

3) Add additional database/data store instances to give the DR site resilience in the data tier. If you are using Amazon RDS, turn on Multi-AZ to improve resilience

4) Change DNS to point at the Amazon EC2 servers

5) install and configure any non AMI based systems, ideally in an automated way.

C) Warm standby solution in AWS

A DR scenario in which a scaled down version of a fully functional environment is always running in the cloud. Fully duplicate business critical systems and have them always on

Key Steps for preparation phase:

1) Setup Amazon EC2 instances to replicate or mirror data

2) Create and maintain AMIs

3) Run your application using a minimal footprint of EC2 instances or AWS infrastructure.

4) Patch and update software and configuration files in line with your live environments.

Key Steps for recovery phase:

1) Increase the size of the Amazon EC2 fleets in service with the load balancer (horizontal scaling)

2) Start applications on larger Amazon EC2 instance types as needed ( vertical scaling)

3) Either manually change the DNS records, or use Amazon Route 53 automated health checks so that all traffic is routed to the AWS environment.

4) Consider using Auto Scaling to right size the fleet or accommodate the increased load

5) Add resilience or scale up your database

D) Mulit Site Solution Deployed on AWS and on site

An Active Active configuration. The data replication method is determined by the RPO.

Key Steps for preparation phase:

1) Set up your AWS environment to duplicate your production environments

2) Set up DNS weighting or similar traffic routing technology to distribute incoming requests to both sites. Configure automated fail over to reroute traffic away from the affected site.

Key Steps for recover phase:

1) Either manually or by using DNS fail over, change the DNS weighting so that all requests are sent to the AWS site

2) Have application logic for fail over to use the local AWS database server for all queries

3) Consider using Auto Scaling to automatically right size the AWS fleet

1.2 Demonstrate ability to implement DR for systems based on RPO and RTO

RTO – recovery time objective. The time it takes after a disruption to restore a business process to its service level.

RPO – recovery point objective. The acceptable amount of data loss measured in time.

1.3 Determine appropriate use of Multi-Availability Zones vs. Multi-Region architectures

Applications deployed on AWS have multi-site capability by means of multiple Availability Zones. Availability Zones are distinct location that are engineered to be insulated from each other. They provide inexpensive low latency network connectivity with the same region. Depending on Business or regulatory requirements, the components for DR can be used in multiple regions.

1.4 Demonstrate ability to implement self-healing capabilities

Content may include the following: High Availability vs. Fault Tolerance

Auto Scaling groups can manage the number of instances in an EC2 fleet. Conditions can be st to add new EC2 instances in increments to the Auto Scaling group depending on factors. These factors can be utilization of instances or unhealthy applications.