Posts tagged “engineering”

23 March 2026

Agentic manual testing

Simon Willison has a practical guide on manual testing with coding agents. Two tips I’ve already started using:

It’s still quick for an agent to write out a demo file and then compile and run it. I sometimes encourage it to use /tmp purely to avoid those files being accidentally committed to the repository later on.

And:

If an agent finds something that doesn’t work through their manual testing, I like to tell them to fix it with red/green TDD. This ensures the new case ends up covered by the permanent automated tests.

14 March 2026

AI should help us produce better code

As usual, Simon Willison hits the nail on the head here:

If adopting coding agents demonstrably reduces the quality of the code and features you are producing, you should address that problem directly: figure out which aspects of your process are hurting the quality of your output and fix them. Shipping worse code with agents is a choice. We can choose to ship code that is better instead.

Also see Mitchell Hashimoto’s idea of “harness engineering”:

It is the idea that anytime you find an agent makes a mistake, you take the time to engineer a solution such that the agent never makes that mistake again.

6 March 2026

Why It's Still Valuable To Learn To Code

Carson Gross has a good essay on whether junior programmers should still learn to code given how capable AI has become. His core warning to students:

Yes, AI can generate the code for this assignment. Don’t let it. You have to write the code. I explain that, if they don’t write the code, they will not be able to effectively read the code. The ability to read code is certainly going to be valuable, maybe more valuable, in an AI-based coding future. If you can’t read the code you are going to fall into The Sorcerer’s Apprentice Trap, creating systems you don’t understand and can’t control.

And on what separates senior engineers who can use AI well from those who can’t:

Senior programmers who already have a lot of experience from the pre-AI era are in a good spot to use LLMs effectively: they know what ‘good’ code looks like, they have experience with building larger systems and know what matters and what doesn’t. The danger with senior programmers is that they stop programming entirely and start suffering from brain rot.

This maps directly onto what I’ve been writing about with AI for product work and the second brain setup I’ve built. The system works because I spent years writing and reading PRDs, strategy docs, and OKRs—enough to develop actual opinions about what good looks like. You have to do the work first, then the second brain is worth building.

1 March 2026

Toolshed, blueprints, and why good agents need good DevEx

Alistair Gray published part two of Stripe’s “Minions” series, going deeper on how they built their internal coding agents. It’s a great read throughout, but three ideas really stood out to me.

First, blueprints. These are workflows that mix deterministic steps with agentic ones:

Blueprints are workflows defined in code that direct a minion run. Blueprints combine the determinism of workflows with agents’ flexibility in dealing with the unknown: a given node can run either deterministic code or an agent loop focused on a task. In essence, a blueprint is like a collection of agent skills interwoven with deterministic code so that particular subtasks can be handled most appropriately.

If you know a step should always happen the same way, don’t let an LLM decide how to do it. Let the agent handle the ambiguous parts, and hardcode the rest (this can also dramatically reduce token cost).

Second, their centralized MCP server:

We built a centralized internal MCP server called Toolshed, which makes it easy for Stripe engineers to author new tools and make them automatically discoverable to our agentic systems. All our agentic systems are able to use Toolshed as a shared capability layer; adding a tool to Toolshed immediately grants capabilities to our whole fleet of hundreds of different agents.

A shared tool layer that all agents can use… 500 tools, one server, hundreds of agents. Very cool idea.

And third, what they call “shifting feedback left”:

We have pre-push hooks to fix the most common lint issues. A background daemon precomputes lint rule heuristics that apply to a change and caches the results of running those lints, so developers can usually get lint fixes in well under a second on a push.

If you can catch a problem before it hits CI, do it there. A sub-second lint fix on push is better than a 10-minute CI failure, whether you’re a person or an LLM burning tokens.

So much of Stripe’s agent success is built on top of investments they made for human developer productivity. Good dev environments, fast feedback loops, shared tooling. The agents benefit from all of it, and developers remain in control.

21 February 2026

The A.I. Disruption Has Arrived, and It Sure Is Fun

Paul Ford writes about vibe coding for the NYT (gift link) and what happens when software suddenly becomes cheap and fast to ship:

There are many arguments against vibe coding through A.I. It is an ecological disaster, with data centers consuming billions of gallons of water for cooling each year; it can generate bad, insecure code; it creates cookie-cutter apps instead of real, thoughtful solutions; the real value is in people, not software. All of these are true and valid. But I’ve been around too long. The web wasn’t “real” software until it was. Blogging wasn’t publishing. Big, serious companies weren’t going to migrate to the cloud, and then one day they did.

And then he brings it home in a way that continues to make him one of my favorite web writers:

The simple truth is that I am less valuable than I used to be. It stings to be made obsolete, but it’s fun to code on the train, too. And if this technology keeps improving, then everyone who tells me how hard it is to make a report, place an order, upgrade an app or update a record — they could get the software they deserve, too. That might be a good trade, long term.

We can grieve what we lost, while also being optimistic about the future AI is unlocking for all of us. It’s uncomfortable, but that’s ok, all technological shifts are.

25 January 2026

Why "Correction of Error" Gets Incidents (and Product Failures) Wrong

I’ve covered “root cause” thinking in incident reviews before, and Lorin Hochstein takes aim at a related issue: AWS’s “Correction of Error” terminology:

I hate the term “Correction of Error” because it implies that incidents occur as a result of errors. As a consequence, it suggests that the function of a post-incident review process is to identify the errors that occurred and to fix them. I think this view of incidents is wrong, and dangerously so: It limits the benefits we can get out of an incident review process.

What makes his critique compelling is the observation that production systems are full of defects that never cause outages:

If your system is currently up (which I bet it is), and if your system currently has multiple undetected defects in it (which I also bet it does), then it cannot be the case that defects are a sufficient condition for incidents to occur. In other words, defects alone can’t explain incidents.

This applies to product work too. When users report problems, our instinct is to find “the bug” and fix it. But often the bug has been there for months—what changed is the context around it. A new user flow, a spike in traffic, a feature interaction we didn’t anticipate. If we stop at “fixed the bug,” we miss the chance to understand why the system let that failure through in the first place.

23 January 2026

The invisible 40% of engineering work

Anton Zaides wrote a good post about shadow work in engineering teams. He discovered a senior engineer on his team was spending over 40% of his time on work that didn’t show up anywhere—code reviews, mentoring, ad-hoc support fixes, etc.

This part is important:

The shadow backlog isn’t the problem—in my opinion, that’s probably the work that should have been done in the first place. The solution is to stop doing it under the table and make sure you have space for it. The more people don’t agree with your roadmap because it was decided for them, the more shadow backlog you’ll have.

The shadow backlog is a symptom of a roadmap that doesn’t reflect reality—and that often happens when engineering teams are not involved in planning and prioritization. That is the real fix—making sure everyone understands and is aligned on the roadmap, and making sure this kind of BAU (Business As Usual) work is visible and planned for.

30 December 2025

Humans make mistakes, and so does AI. It's fine.

Will Larson has a good post about implementing “Agent Skills” in their internal agent framework at Imprint. The whole piece is worth reading, but I wanted to highlight this observation:

Humans make mistakes all the time. For example, I’ve seen many dozens of JIRA tickets from humans that don’t explain the actual problem they are having. People are used to that, and when a human makes a mistake, they blame the human. However, when agents make a mistake, a surprising percentage of people view it as a fundamental limitation of agents as a category, rather than thinking that, “Oh, I should go update that prompt.”

There’s a double standard at play here that I’ve noticed too. When a colleague writes a confusing document, we ask them to clarify. When an agent produces something off, we tend to smirk and declare the technology fundamentally broken.

The fix is often as simple as updating a prompt—the same way you’d coach a team member to write better tickets. Skills, in Larson’s implementation, are essentially reusable prompt snippets that encode learned behaviors across workflows. It’s the kind of organizational knowledge we build up with people over time, just made explicit.

16 December 2025

Building MCP servers in the real world

This has been my experience with MCP servers as well. As useful as I think my Last.fm MCP server is, I can’t see it every having more than a dozen users. But internal company servers are massively useful:

MCP is being used especially heavily by internal data and platform teams to give internal users access to systems. These are systems that these users perhaps already had access to, but it was either too complex or too broad, or needed a lot of documentation or special skills to use.

Wiki search is so much better now that I can use our internal MCP server for it via Windsurf.

Source: Building MCP servers in the real world →

16 December 2025

Measuring AI's Impact on Shipping Speed and Code Quality

Will Larson has a good post about how they’re adopting AI at his company. The process is interesting, but this is the part that jumped out at me:

My biggest fear for AI adoption is that they can focus on creating the impression of adopting AI, rather than focusing on creating additional productivity. Optics are a core part of any work, but almost all interesting work occurs where optics and reality intersect.

It’s really hard to figure out if AI tools are (1) helping teams ship faster (2) without sacrificing quality.

We’re working on figuring out this problem right now at Cloudflare. Our proposed approach sidesteps the problem of per-commit AI attribution (did Copilot write this line? did Claude?) by correlating team-level AI tool usage with team-level health metrics over time. If a team’s AI adoption increases by 30% and their change failure rate stays stable, that’s a useful signal. If AI usage spikes and incidents start trending up, that’s worth investigating.

The key insight is that you don’t need perfect attribution to get directionally useful data. Correlation isn’t causation, and teams adopting AI tools may already be more experimental or higher-performing. But at least you’re measuring something real instead of the something like “# of lines written by AI”, which leads straight to the Goodhart’s Law problem where metrics become targets.