One of the hallmarks of serverless architectures is ephemeral compute resources. All three major cloud service providers place execution time limits on the order of minutes (five minutes for AWS and Azure, nine for Google). Even more challenging can be the limits imposed by API gateways placed in front of functions. For example, AWS API Gateway only waits up to 30 seconds for a Lambda function to respond before it times out.
These limits enforce a micro-service approach to web application architecture. As a developer you must devise ways of handling requests in a short amount of time. But even after you have designed an architecture that fits within the timeout limits, how will you monitor and debug issues when timeouts inevitably occur. It's worth noting that while timeouts can be due to bugs we introduce into the system, not all timeouts are within our control. Sometimes an external service we depend on may take too long or return incorrect data that affects our functions' execution.
One way to track timeouts is to watch the function logs. AWS Lambda will log a message when a function times out:
START RequestId: 6f747a63-ce41-11e7-8cfe-3d61dad3d219 Version: $LATEST
END RequestId: 6f747a63-ce41-11e7-8cfe-3d61dad3d219
REPORT RequestId: 6f747a63-ce41-11e7-8cfe-3d61dad3d219 Duration: 3001.52 ms Billed Duration: 3000 ms Memory Size: 128 MB Max Memory Used: 20 MB
2017-11-20T22:23:32.998Z 6f747a63-ce41-11e7-8cfe-3d61dad3d219 Task timed out after 3.00 seconds
While the logs aren't readily human-readable, it's possible to watch the logs with another Lambda function or pull logs into an external logging solution to notice when functions timeout. But there can be considerable delay (on the order of up to a minute or more) for logs to be analyzed no matter which way you go. It also becomes hard to piece together what causes and symptoms result from functions timing out when looking at logs. While it's certainly possible to build a system based solely on logs, there is a better way.
At Stackery we wanted a solution for noticing timeouts and handling them gracefully in a more realtime and flexible fashion. To enable this, inside our functions we added a wrapper that injects a watchdog timer. Watchdog timers are set to an interval within which an expected amount of work should occur. For example, if you expect a compute task to take 10 seconds you might set a watchdog timer for 15 seconds. If the task isn't complete after 15 seconds the timer triggers other logic that handles the exceptional case.
In other contexts, the code that handles exceptional cases can generally take as much time as it needs. For example, in many hardware use cases a watchdog timer will trigger reset functionality that reboots a component of the system. The reboot, which shouldn't happen very often, can take its time to ensure everything is restored to working order. In contrast, if we want to use watchdog timers for serverless use cases we need to be cognizant of the amount of time spent handling the exceptional case when the trigger fires because our exception handling logic runs in the same Function that has an overall time limit placed on it.
Our solution was to install a watchdog timer that fires a set amount of time before the function itself would timeout. In that time, the function can notify another resource that it has timed out. For example, an AWS Lambda function responding to API Gateway requests may have an API Gateway timeout of 30 seconds and a Lambda timeout of 31 seconds. When the function starts, it sets a watchdog timer for 30 seconds. If that timer is triggered we know the API Gateway response has already timed out but we have one more second to send a notification message off with diagnostics info. Let's look at some example Node.js code:
const lambda = new AWS.Lambda();
exports.handler = function (event, context, callback) {
// Install watchdog timer as the first think you do
setupWatchdogTimer(event, context, callback);
// Do stuff...
}
function setupWatchdogTimer(event, context, callback) {
// When timer triggers, emit an event to the 'myErrorHandler' function
const timeoutHandler = () => {
// Include original event and request ID for diagnostics
const message = {
message: 'Function timed out',
originalEvent: event,
originalRequestId: context.awsRequestId
};
// Invoke with 'Event' handling so we don't want for response
const params = {
FunctionName: 'myErrorHandler',
InvocationType: 'Event',
Payload: JSON.stringify(message)
};
// Emit event to 'myErrorHandler', then callback with an error from this
// function
lambda.invoke(params).promise()
.then(() => callback(new Error('Function timed out')));
}
// Set timer so it triggers one second before this function would timeout
setTimeout(timeoutHandler, context.getRemainingTimeInMillis() - 1000);
}
The above function emits an event to another Lambda function when it times out. But it not only tells the other function that it timed out, it also passes along details about the original message that caused the timeout. This data can be sent to error aggregating services, like Rollbar or Sentry for debugging. It can also be used for custom retry or other error handling logic. For example, it could send an email to the affected user (based on data from the original event message) to tell them you noticed the issue and customer service will follow up with a resolution soon.
We implemented the above for our own needs within Stackery early on. We realized quickly that everyone using serverless functions would want the same functionality, so we built it into our product. Every function deployed by Stackery has a small wrapper that captures timeouts (and other uncaught errors) and emits events like the above. These events are sent to all the nodes connected to the output of the special Errors node. It's essentially a stack-wide exception handler.
As serverless architectures continue to gain acceptance in business critical applications it will be important to develop techniques that enable monitoring and operations. Tracking and handling timeouts is one key piece of the serverless ops puzzle.