I Have Good News And Bad News About Your Cloud Metrics

If you’ve ever partnered with AWS when building an application you know that you go through an Infrastructure Event Management (IEM) plan in the weeks leading up to your go-live event. Part of the IEM is walking through all aspects of your application with key personnel from your company to determine if you’re ready for launch operationally.

The first time I was in one of these meetings, things seemed to go really well… until they didn’t. We were deep in an Operational Readiness Review and questions started popping up about success measures.

At the time I was the development manager and running point on the project. Our AWS account manager asked me a simple question, “what metrics are you tracking that measure your success?” to which I simply gave him a deer in the headlights look. I had no idea. We never had discussions around what success looked like.

In my role, success looked like our infrastructure handling production workloads without faltering. That’s why we went serverless. But that’s measuring success from a technical side. It was an incomplete and short-sighted approach to success.

What I needed was an answer for how we measure success as a business. Why did we build the application to begin with? What were our product goals? How would we know when we reached them? How would we know if we were drastically off?

2023 is the year we focus on observability. But not just observability of our infrastructure. Observability of our business processes. Observability that tells us if we’re hitting our KPIs. Dead letter queue counts are useful, but they don’t tell us if our application is successful.

Establish Your KPIs

Success of your application is directly tied to your Key Performance Indicators (KPIs). KPIs are intended to help you measure progress toward strategic business goals, which oftentimes are not technical in nature.

Let’s take an example.

Take our fictional web app Gopher Holes Unlimited. This is an app for tracking two things: gophers and gopher holes. As a product owner, the primary objective is to make sure consumers know where nearby gophers and gopher holes are. We can break this down into the following SMART KPIs.

Gopher data is entered with a 90% completion rate
Users are notified of new gophers within 5 miles of their home less than 30 seconds of being added
75% of gopher holes are linked to a tracked gopher

These KPIs have nothing to do with the tech that backs the application. They are completely business driven and map directly to the objectives of the product owner. If we can hit these three measures, we consider the application a success.

Don’t Confuse Them With SLAs

Of course, you can have technical KPIs as well, but those tend to be more Service-Level Agreement (SLA) oriented. SLAs are intended to define performance and availability metrics and generally map directly to technical requirements and objectives.

The Gopher Holes Unlimited API is available 99.9% of the time
No undelivered messages are lost
All API endpoints respond in < 500ms
Less than 1% of API responses are 5XX errors

These objectives are directly mapped to infrastructure monitoring. There are no business terms in there, they correlate with the performance of the application itself.

When building with many of the observability tools out there today, you get support for SLA metrics out-of-the-box. They are managed incredibly well and generally have invaluable insights associated to them.

Metrics like Lambda function concurrency, memory usage, and duration are all supported and relate back to our SLAs. API Gateway metrics like 5XX counts, p99 latency, and cache hit count are critically important metrics, but don’t map to business value. They are availability and performance oriented.

We’re even beginning to see some incredibly powerful dashboards being provided by vendors by default with no configuration necessary!

These are all metrics that power your SLA. It’s what we all have and use today. But bad news - it’s not enough.

How To Track Your KPIs

After my series of non-answers in the Well-Architected Review with AWS, I gathered all our product folks in a room. I asked them the same question that was asked of me, “How do we measure success with our product?”

Initially I got the same response I had given to AWS… crickets. Not a good start.

We bounced ideas off each other but didn’t make real progress until we took a step back and asked ourselves why we built the product. Starting with why allowed us to drill down into supporting objectives. Identifying the supporting objectives opened the door to creating actionable, trackable, realistic KPIs that we all agreed were indicators of our success.

With measures of success identified, we needed a way to track the measurement. This is where custom metrics come in.

Using CloudWatch, you can publish custom metrics that track single data points and statistic sets. Data points have a range of units to choose from, allowing you to track virtually any type of business metric.

CloudWatch custom metrics can be a bit intimidating to work with in the beginning, but if you’re using JavaScript/TypeScript or Python, you’re in luck. Lambda Powertools provides an easy way to start injecting metrics into your application.

//
// Add gopher hole function
//
import { Metrics, MetricUnits } from '@aws-lambda-powertools/metrics';
const metrics = new Metrics({ namespace: 'gopherholesunlimited', serviceName: 'gophers' });

export const handler = async (event) => {
  const newGopherHoleData = JSON.parse(event.body);
  metrics.addMetric('gopherHoles', MetricUnits.Count, 1);
  if(newGopherHoleData.gopherId) {
    metrics.addMetric('gopherHolesLinkedToGophers', MetricUnits.Count, 1);
  }

  // Logic to add a new gopher hole goes here

  metrics.publishStoredMetrics();
}

The metrics above are added to support the 75% of gopher holes are linked to a gopher KPI. You see it has no relation to any infrastructure measure at all, it is completely custom and written into the business logic of the application.

Once your business-related metrics are tracked, you can build dashboards and alarms around them. Build views that consolidate your objectives into a single place, making them visible at a glance.

It’s one thing to simply have dashboards, it’s another thing entirely to actually use them.

Remember, these KPIs are your measures of success. Once they are created they need to become a priority. Bring them up in your daily stand-ups. Reference them in your one-on-ones. Whatever you do, pay attention to them!

Final Thoughts

I learned the hard way that infrastructure metrics aren’t the only thing you need to be monitoring in your application. They are great for load tests and tracking your SLA, but they aren’t great at measuring success as a business.

One could argue maintaining your SLA is an indicator of success and I partially agree. You won’t be successful if your availability and performance drop below a certain threshold. However, you also won’t be successful if your only focus is availability.

You need to focus on the business value that keeps your consumers happy. Optimizing around user experience and proficiency will make you a differentiator. You’ll excel above your competitors in the market because you are business metric driven and you make them a priority as you assess what’s next.

If you haven’t already, get your product and dev team together to talk about success. How would you objectively track what success looks like? Build custom metrics into your application code and surface the business value to everyone involved. Show meaningful indicators to your stakeholders. Track metric progress over time to make sure you’re constantly improving.

Ideally you will determine your measure of success prior to building the application. But if you already started or in my case, about to go live, it’s never to late to identify and implement them.

So, good news - the metrics you’re already tracking are useful and help you provide consistent performance and availability. Bad news - they don’t make up the full picture.

Happy coding!