Serverless is great. It takes many worries out of your hands and puts them into the hands of your cloud provider, like AWS.
Things like scalability, reliability, and server maintenance go right out the door when it comes to your responsibilities. But that doesn’t mean you don’t have responsibilities of your own.
The purpose of a load test is to see if your application can handle expected (plus a little more) traffic. But that’s what the whole serverless scalability premise is, right? This shouldn’t be your problem?
You still have the responsibility of making sure you string services together correctly and have provided your endpoints proper throttling mechanisms.
A serverless load test is not meant to test if your cloud vendor can scale, but rather if you’ve designed software intended to scale.
With this in mind, there are several things you need to do before you hit the “Go” button on that load test.
When running a load test, you want to make sure the actions that will be performed 80% of the time will scale the best. In an ideal world we could load test everything, but in a pragmatic world we need to test the majority use cases.
If you have a shopping cart app, your primary use cases will be adding items to your cart, removing items from your cart, and checking out. There are many other things you can do in a shopping app, but 80% of the time, your users will be doing one of those three actions.
Hopefully by the time you’ve decided to run load tests, you have proactive monitoring in place. This means you have tests that run fake data through your system at regular intervals in order to catch problems before your customers see them.
There are many tools you can use for proactive monitoring:
In my previous post about load tests, I spoke about using the AWS distributed load testing tool with Postman to run load tests. This has proven to be very successful for serverless load tests, and I continue to recommend it.
But with all the different Application Performance Monitoring (APM) tools, what if your tests aren’t in Postman?
Your options are to recreate the tests in Postman or to convert them. If you use DataDog synthetics, I have a Public Postman Workspace you can use to convert your multi-step API synthetic tests to a Postman collection.
When something goes wrong in a serverless application, the best practice is to send the event to a dead letter queue. But don’t stop there.
Once an item is in a dead letter queue, what do you do?
Your load test is going to result in some failures in your application. It should intentionally send events to an error state so you can get a gauge on how your application handles failures.
How do you handle failures in a serverless application?
There are two primary types of errors:
Being able to track these errors separately is key to getting back on your feet quickly and easily. A transient error, in theory, is a retryable error that the system can take care of itself. It can backoff and retry to see if the blip has been resolved.
LEGO.com has a great reference on how they retry transient errors and keep track of health in their system automatically.
Data errors tend to need human interaction. Whether it is from devs on your app team or the end user, a person is needed to resolve the issues. In these instances, you need a way to alert those responsible for the fix. You could send an email, a message in slack, an in-app notification, etc…
When these types of errors show up in the queue, you must provide a way for someone to know a problem occurred and give them a way to fix it.
A load test doesn’t have much value if you don’t have a way to see how your system is performing. Building a dashboard for your serverless application will allow you to see how it is performing in a number of ways.
Lambda monitoring dashboard
Load tests help identify bottlenecks in your system, which are key stress points of your application. If you have a bottleneck somewhere, it will affect performance and scalability. Be sure to monitor SQS queue item count and time in queue to quickly identify areas that will slow your system down.
Just like everything else in the cloud, you have options when it comes to dashboards. You can use AWS CloudWatch, DataDog, Dynatrace, or many others to build exactly what you need.
When the load test is over, you need an objective way to determine if it was a success. Enter Key Performance Indicators (KPI).
A workload KPI is a metric you use to determine if your application is behaving as it should. These range widely based on the nature of your application. To take the shopping cart as an example, you could have a KPIs around processing a payment.
These are business metrics around the performance of your application. If your load test results in a miss on any of these, you know where you need to put your focus. Without KPIs, it is impossible to tell if your application is doing what you expect.
Load tests should provide about 20% more traffic than the highest you expect. If you expect 1,000 requests a minute at peak burst times, then your load test should run about 1,200 requests a minute through the system.
If you run your load test at an impossible to reach amount (say 1,000,000 requests a minute in our above scenario), you aren’t going to get realistic results. Yes, you would stress test your system, but it does not need to be stressed to that capacity.
Ultimately it’s not wrong to go above and beyond, but you could be making unnecessary work for yourself. If your application starts breaking at 100,000 requests per minute but your largest expected scale is 1,000 requests per minute, you could probably spend your time better fixing other areas of your application.
Knowing the expected traffic to your application can be difficult, especially in a new application. If you sell your app on a customer by customer basis, it’s easy to know what traffic will be. But estimating traffic on a public website can turn into the wild west. You must turn to things like SEO rankings for major keywords used to find your site.
Load tests are a fun and exciting part of product launch. You get to see how your system does in a real world scenario, but still in a controlled environment. Once you fix the initial issues, consider adding a reduced load test into your CI/CD pipeline to make sure ongoing development doesn’t start throttling you somewhere.
Remember to focus on your primary business cases. You want to make sure the areas of the application that get the most use are hit the hardest. Areas of the application used for initial setup or seldom used areas do not need to be load tested.
KPIs are going to be your primary indicator of success here. If you get lots of errors in dead letter queues, but they are retried and ultimately go through the system, your KPIs might still consider that a success. Getting an error does not mean a failure in your business process. With serverless, it’s all about reliability and robustness. In fact, it is an AWS serverless design principle!
If your application can automatically recover, then it will be much more likely to handle a significant load and keep hitting your KPIs.
Best of luck with your load tests. Have fun!