This is part two of a multi-part blog series. In the previous post, we covered Disaster Recovery planning when building serverless applications. In this post, we'll discuss the systems engineering needed for an automated solution in the AWS cloud.
As I started looking into implementing Stackery's automated backup solution, my goal was simple: In order to support a disaster recovery plan, we needed to have a system that automatically creates backups of our database to a different account and to a different region. This seemed like a straightforward task, but I was surprised to find that there was no documentation on how to do this in an automated, scalable solution - all existing documentation I could find only discussed partial solutions and were all done manually via the AWS Console. Yuck.
I hope that this post will make help fill that void and help you understand how to implement an automated solution for your own disaster recovery solution. This post does get a bit long so if that's not your thing, see the tl;dr.
AWS RDS has automated backups which seemed like the perfect platform to base this automation upon. Furthermore, RDS even emits events that seem ideal for using to kick off a lambda function that will then copy the snapshot to the disaster recovery account.
The first issue I discovered was that AWS does not allow you to share automated snapshots - AWS requires that you first make a manual copy of the snapshot before you can share it with another account. I initially thought that this wouldn't be a major issue - I can easily make my lambda function first kick off a manual copy. According to the RDS Events documentation, there is an event RDS-EVENT-0042
that would fire when a manual snapshot was created. I could then use that event to then share the newly created manual snapshot to the disaster recovery account.
This leads to the second issue - while RDS will emit events for snapshots that are created manually, it does not emit events for snapshots that are copied manually. The AWS docs aren't clear about this and it's an unfortunate feature gap. This means that I have to fall back to a timer based lambda function that will search for and share the latest available snapshot.
While this ended up more complicated than initially envisioned, Stackery still makes it easy to add all the needed pieces for fully automated backups. My implementation ended up looking like this:
The DB Event Subscription
resource is a CloudFormation Resource in which contains a small snippet of CloudFormation that subscribes the DB Events
topic to the RDS database
This function will receive the events from the RDS database via the DB Events
topic. It then creates a copy of the snapshot with an ID that identifies the snapshot as an automated disaster recovery snapshot
const AWS = require('aws-sdk');
const rds = new AWS.RDS();
const DR_KEY = 'dr-snapshot';
const ENV = process.env.ENV;
module.exports = async message => {
// Only run DB Backups on Production and Staging
if (!['production', 'staging'].includes(ENV)) {
return {};
}
let records = message.Records;
for (let i = 0; i < records.length; i++) {
let record = records[i];
if (record.EventSource === 'aws:sns') {
let msg = JSON.parse(record.Sns.Message);
if (msg['Event Source'] === 'db-snapshot' && msg['Event Message'] === 'Automated snapshot created') {
let snapshotId = msg['Source ID'];
let targetSnapshotId = `${snapshotId}-${DR_KEY}`.replace('rds:', '');
let params = {
SourceDBSnapshotIdentifier: snapshotId,
TargetDBSnapshotIdentifier: targetSnapshotId
};
try {
await rds.copyDBSnapshot(params).promise();
} catch (error) {
if (error.code === 'DBSnapshotAlreadyExists') {
console.log(`Manual copy ${targetSnapshotId} already exists`);
} else {
throw error;
}
}
}
}
}
return {};
};
A couple of things to note:
process.env.ENV
based on the environment the stack is deployed todr-snapshot
to the id of the snapshot that is createdThis function runs every few minutes and shares any disaster recovery snapshots to the disaster recovery account
const AWS = require('aws-sdk');
const rds = new AWS.RDS();
const DR_KEY = 'dr-snapshot';
const DR_ACCOUNT_ID = process.env.DR_ACCOUNT_ID;
const ENV = process.env.ENV;
module.exports = async message => {
// Only run on Production and Staging
if (!['production', 'staging'].includes(ENV)) {
return {};
}
// Get latest snapshot
let snapshot = await getLatestManualSnapshot();
if (!snapshot) {
return {};
}
// See if snapshot is already shared with the Disaster Recovery Account
let data = await rds.describeDBSnapshotAttributes({ DBSnapshotIdentifier: snapshot.DBSnapshotIdentifier }).promise();
let attributes = data.DBSnapshotAttributesResult.DBSnapshotAttributes;
let isShared = attributes.find(attribute => {
return attribute.AttributeName === 'restore' && attribute.AttributeValues.includes(DR_ACCOUNT_ID);
});
if (!isShared) {
// Share Snapshot with Disaster Recovery Account
let params = {
DBSnapshotIdentifier: snapshot.DBSnapshotIdentifier,
AttributeName: 'restore',
ValuesToAdd: [DR_ACCOUNT_ID]
};
await rds.modifyDBSnapshotAttribute(params).promise();
}
return {};
};
async function getLatestManualSnapshot (latest = undefined, marker = undefined) {
let result = await rds.describeDBSnapshots({ Marker: marker }).promise();
result.DBSnapshots.forEach(snapshot => {
if (snapshot.SnapshotType === 'manual' && snapshot.Status === 'available' && snapshot.DBSnapshotIdentifier.includes(DR_KEY)) {
if (!latest || new Date(snapshot.SnapshotCreateTime) > new Date(latest.SnapshotCreateTime)) {
latest = snapshot;
}
}
});
if (result.Marker) {
return getLatestManualSnapshot(latest, result.Marker);
}
return latest;
}
ENV
and DR_ACCOUNT_ID
environment variables.AttributeName
should be set to restore
(see the AWS RDS SDK)This function will run in the Disaster Recovery account and is responsible for detecting snapshots that are shared with it and making a local copy in the correct region - in this example, it will make a copy in us-east-1
.
const AWS = require('aws-sdk');
const rds = new AWS.RDS();
const sourceRDS = new AWS.RDS({ region: 'us-west-2' });
const targetRDS = new AWS.RDS({ region: 'us-east-1' });
const DR_KEY = 'dr-snapshot';
const ENV = process.env.ENV;
module.exports = async message => {
// Only Production_DR and Staging_DR are Disaster Recovery Targets
if (!['production_dr', 'staging_dr'].includes(ENV)) {
return {};
}
let [shared, local] = await Promise.all([getSourceSnapshots(), getTargetSnapshots()]);
for (let i = 0; i < shared.length; i++) {
let snapshot = shared[i];
let fullSnapshotId = snapshot.DBSnapshotIdentifier;
let snapshotId = getCleanSnapshotId(fullSnapshotId);
if (!snapshotExists(local, snapshotId)) {
let targetId = snapshotId;
let params = {
SourceDBSnapshotIdentifier: fullSnapshotId,
TargetDBSnapshotIdentifier: targetId
};
await rds.copyDBSnapshot(params).promise();
}
}
return {};
};
// Get snapshots that are shared to this account
async function getSourceSnapshots () {
return getSnapshots(sourceRDS, 'shared');
}
// Get snapshots that have already been created in this account
async function getTargetSnapshots () {
return getSnapshots(targetRDS, 'manual');
}
async function getSnapshots (rds, typeFilter, snapshots = [], marker = undefined) {
let params = {
IncludeShared: true,
Marker: marker
};
let result = await rds.describeDBSnapshots(params).promise();
result.DBSnapshots.forEach(snapshot => {
if (snapshot.SnapshotType === typeFilter && snapshot.DBSnapshotIdentifier.includes(DR_KEY)) {
snapshots.push(snapshot);
}
});
if (result.Marker) {
return getSnapshots(rds, typeFilter, snapshots, result.Marker);
}
return snapshots;
}
// Check to see if the snapshot `snapshotId` is in the list of `snapshots`
function snapshotExists (snapshots, snapshotId) {
for (let i = 0; i < snapshots.length; i++) {
let snapshot = snapshots[i];
if (getCleanSnapshotId(snapshot.DBSnapshotIdentifier) === snapshotId) {
return true;
}
}
return false;
}
// Cleanup the IDs from automatic backups that are prepended with `rds:`
function getCleanSnapshotId (snapshotId) {
let result = snapshotId.match(/:([a-zA-Z0-9-]+)$/);
if (!result) {
return snapshotId;
} else {
return result[1];
}
}
ENV
, I ensure this function only runs in the Disaster Recovery accountsHave a second function, that monitors for the successful creation of the snapshot from the first function and shares it to your disaster recovery account.
Have a third function that will operate in your disaster recovery account that will monitor for snapshots shared to the account, and then create a copy of the snapshot that will be owned by the disaster recovery account, and in the correct region.