Going Serverless? Build An Observability Mindset

When I started building my first serverless application, I was super excited to see how fast it started to come together. Serverless lends itself really well to rapid development, shorter iteration cycles, and finding the right product-market fit as quickly as possible.

As my team and I built the app, we naturally started putting it through its paces. We loved how fast it was, how scalable it was, and how quickly we went from idea to something real on our screens.

But that excitement didn’t last very long.

There was a bug in our code somewhere preventing us from completing a critical business process. So I was told to find out what it was and fix it. I went over to one of the engineers on my team and we began troubleshooting the issue.

We eventually found the issue was coming from a consumer of an SNS topic. It simply wasn’t running. We didn’t know if it was an issue with publishing the event or if it was an error in the consumer itself. We couldn’t tell where exactly the problem was in order to fix it.

Needless to say, this was a pretty stressful situation for me. We were building this big distributed system on an event-driven architecture, but we didn’t know how to track data through the system. It made me worried we made the wrong call to go serverless.

Observability is hard, and it’s even harder if you don’t build with it in mind. Backing into observability after your application is up and running is a difficult task.

The Reason Behind Observability

I had the opportunity recently to sit down with several different observability platforms. I met with Lumigo, Baselime, Logz.io, DataDog, and a couple of early phase startups to discuss the current state and future of observability.

Chatting with so many different vendors made me realize how important observability truly is in a serverless environment. Despite it being the main focus of the operational excellence pillar of the well architected framework for serverless, it never really clicked with me.

It wasn’t until I had a conversation with Tomer Levy that I realized the real reason we should put so much emphasis on observability.

Observability is about hitting your SLA.

Knowing when things start to go wrong, following traces though architecture diagrams, and aggregating data out of logs all circle back to maintaining up time and availability in your system. If you don’t have tools that point you where and when there is a problem, you quickly begin to fall out of your SLA and give poor experiences to your customers.

Maintaining a consistently high availability and addressing problems proactively is not a “nice to have” feature in SaaS. It’s an understood part of being successful.

A thankless component of software, nobody notices when you do observability right and respond to issues before the get reported. But if you have a blip in your service, you quickly start losing not only your customers, but your credibility as a reliable SaaS provider as well.

There are many options available at your fingertips to get started with observability. It doesn’t matter if you use an application performance monitoring (APM) tool or any of the new observability features just released in CloudWatch, the important thing is that you are observing your workload.

There’s a reason there are so many companies that specifically target APM and observability - it’s important! It is a critical part of building an application that offers a high degree of reliability to consumers.

Time and time again, I see serverless apps being built that don’t consider observability until the very end. The code is written then the monitoring tools are put in place afterward.

While there is nothing wrong with this approach (and we’ve all done it), it does lend itself to some less than ideal monitoring paradigms. You start fighting the tools or changing code when you try to put them in after the fact.

Final Thoughts

When building your application, treat monitoring and observability as a “first-class citizen”. It is a part of your architecture that is just as important as any other component.

Having strong observability practices means you can react to events in real-time. Managing errors, scaling infrastructure, and tracing distributed sagas are all vital aspects to protecting your availability SLA. High availability is no longer a nice feature in an application - it is a must have. The right observability practices can help you manage issues before they become major problems.

There are many software vendors that offer their take on observability best practices. Rely on them to guide you to the right things to watch and how to react when things go south. Find the right tool that matches your needs and the skillset of your team.

If we don’t put an early emphasis on observability, we’ll never fix it. Start incorporating it into your workflows, designs, and patterns. As they say, the best time to start was last year. The second best time is today.

Happy coding!