Ready, Set, Cloud!

🦸 Community Superhero

Our community superhero this week is Marc Campora, founder and president of LAZGAR and AWS Community Builder. Marc has a sharp focus on helping organizations through digital transformation with strong architecture and strategic guidance. He has a wealth of experience combining business goals with technology, and is always thoughtful in how he shares that expertise. Marc has the kind of presence that helps everyone around him think more clearly. Thank you for everything you do, Marc!

💯 Spotlight

Of course this week’s spotlight is going to address the elephant in the room - the internet outage last week caused by a DNS automation bug inside DynamoDB. In case you were offline on Monday - pretty much nothing worked. The AWS us-east-1 region had major service degradation three times over a ~14 hour period. I recommend reading the write-up from AWS on how issues cascaded through multiple services, but I’ll explain the root cause, which I find fascinating (and validating) that race conditions can happen to anyone, anywhere, at any time.

DynamoDB manages hundreds of thousands of DNS records for its internal load balancers. It has two systems that manage these records: the DNS Planner and the DNS Enactor. There are three Enactors running in different AZs to maximize reliability of the service. On Monday, a race condition occurred:

Enactor A was applying an old plan but got delayed (it happens).
Meanwhile, Enactor B quickly applied a newer plan and then deleted old ones.
Enactor A finally finished and overwrote the newer plan with its old one.
Immediately afterward, Enactor B’s cleanup deleted that now-active (old) plan.
This resulted in the regional DynamoDB DNS record becoming empty, effectively deleting dynamodb.us-east-1.amazonaws.com

It’s always DNS, am I right?! 🙃

🔥 My Favorite Content

I found an article from Ian Brumby last week on a topic near and dear to my heart - real-time notifications. Ian wrote about using EventBridge Pipes and AppSync events to deliver updates in real time to his web app. He has a fun use case with plane routes and how they are affected by hazards, but what you really need to take away from it is the pattern. The pattern he shows that streams events from DynamoDB to your message broker through pipes is powerful and easily extensible. He even shares the source code with us so you can try it yourself.

You know what’s difficult? Designing event-driven architectures. You know what’s more difficult? Growing and evolving them responsibly and scalably. I’ve personally found myself in a choreographed corner of shame more than once because I didn’t follow best practices. But what are the best practices? James Eastham shares with us what they are and how to grow your EDA simply in his video from last week. In classic James fashion, the video is well-explained, talks through code (TypeScript!!), and covers some hard topics with event-driven systems.

Andres Moreno and I continued our livestreaming series on enterprise AI agent design principles. This time, we focused on data validation for both inputs and outputs. We discussed the reasons why you need both, covering the different types of attack vectors against LLMs. This session ran through an actual (mostly) implemented use case of an autonomous AI agent, bringing real-world practice alongside the theory. TL;DR - if you aren’t using guardrails, start today! If you want to try out our agent-building test harness for free, it is available on GitHub.

I’ve been noticing a trend in security-rooted articles recently: defaults are dangerous. Dave Hall posted a blog last week on immutable container image tags that had this as the underlying message. Going with the default latest tag on your container images really isn’t a best practice, but it’s an easy default. Dave does a good job in his article making a case for immutable image tags and even shows how you can do it with a handful of different providers. I honestly had never thought about this before, but I’ve been convinced to stay away from latest.

Cloudflare always does a phenomenal job on their blogs - and their post on secure agentic commerce is no exception. Agentic shopping is coming quickly, and it’s just so interesting to see how big providers like Cloudflare, Visa, and Mastercard are quickly preparing themselves against malicious bots and fraud. I really like where this is headed.

💡 Tip of the Week

I know lots of you reading this newsletter are either content creators or aspiring to be. One of the understated components of effective content creation is social media marketing (like it or not). Our friend Johannes Koch has been working on a tool to help make that easier and more reliable so you can grow your audience faster than ever. He’s currently looking for early adopters to try it out and give him feedback.

🐣 New Releases

Amazon CloudWatch released interactive incident reporting, which builds post-incident analysis reports for you automatically. It gathers metrics and data from around your account and compiles them and gives you insights in various formats. Pretty cool!

Aurora DSQL is not available in the Frankfurt region in Europe. It’s nice to see this growing to new regions.

Last Words

There were a lot of mixed and interesting signals on social media last week following the AWS outage. I don’t really agree on a huge sales campaign on the heels of an outage, and I saw plenty of that on LinkedIn. We should be empathetic in situations like this, not turn our backs and say “this is why you need multi-cloud, here’s your out.” I saw lots of similar opinions to mine, which I’m happy to see, but I’m curious if you share the same thoughts.

That’s my take on the week, but what’s yours?

What did I miss? What made you nod along (or 🙄)? Hit reply if you’re reading the email. Prefer socials? Ping me on Twitter, LinkedIn, or email.

Happy coding!

Allen