Toolshed, blueprints, and why good agents need good DevEx

Alistair Gray published part two of Stripe’s “Minions” series, going deeper on how they built their internal coding agents. It’s a great read throughout, but three ideas really stood out to me.

First, blueprints. These are workflows that mix deterministic steps with agentic ones:

Blueprints are workflows defined in code that direct a minion run. Blueprints combine the determinism of workflows with agents’ flexibility in dealing with the unknown: a given node can run either deterministic code or an agent loop focused on a task. In essence, a blueprint is like a collection of agent skills interwoven with deterministic code so that particular subtasks can be handled most appropriately.

If you know a step should always happen the same way, don’t let an LLM decide how to do it. Let the agent handle the ambiguous parts, and hardcode the rest (this can also dramatically reduce token cost).

Second, their centralized MCP server:

We built a centralized internal MCP server called Toolshed, which makes it easy for Stripe engineers to author new tools and make them automatically discoverable to our agentic systems. All our agentic systems are able to use Toolshed as a shared capability layer; adding a tool to Toolshed immediately grants capabilities to our whole fleet of hundreds of different agents.

A shared tool layer that all agents can use… 500 tools, one server, hundreds of agents. Very cool idea.

And third, what they call “shifting feedback left”:

We have pre-push hooks to fix the most common lint issues. A background daemon precomputes lint rule heuristics that apply to a change and caches the results of running those lints, so developers can usually get lint fixes in well under a second on a push.

If you can catch a problem before it hits CI, do it there. A sub-second lint fix on push is better than a 10-minute CI failure, whether you’re a person or an LLM burning tokens.

So much of Stripe’s agent success is built on top of investments they made for human developer productivity. Good dev environments, fast feedback loops, shared tooling. The agents benefit from all of it, and developers remain in control.

23 February 2026

Project Brains: Organizing Complex Initiatives for AI-Assisted Work

I’ve written before about how I use AI for product work and how that workflow evolved with slash commands and skills. This post focuses on how to maintain context for complex, long-running projects.

The Problem: Context Fragmentation

When I’m working on a major initiative, relevant information ends up scattered everywhere: PRDs in one tool, tickets in another, meeting notes in a third, plus emails and chat threads. Every time I return to a project after a few days, I spend time reconstructing where things stand.

AI assistants can make this worse because each conversation starts fresh. I can reference files, but the model doesn’t know which files matter for this project, what decisions we’ve already made, or what questions remain open. I end up re-explaining context that should be obvious.

Project brains solve this by creating a dedicated folder for each major initiative with a standard structure that both humans and AI can navigate.

What a Project Brain Looks Like

The structure looks like this:

projects/[project-name]/
├── CONTEXT.md        # The hub: status, stakeholders, decisions, open questions
├── artifacts/        # PRDs, specs, designs, one-pagers
├── decisions/        # Decision logs with rationale and alternatives
├── research/         # Customer feedback, data analysis, technical investigation
└── meetings/         # Meeting notes related to this project

The CONTEXT.md file is a living document that answers the questions I’d need to answer every time I pick up a project:

What’s the current status?
Who are the stakeholders and what do they care about?
What decisions have we made and why?
What questions are still open?
Where are the relevant artifacts?

When I start a conversation about a project, I point the AI to the project folder. It reads CONTEXT.md first, then can drill into specific artifacts as needed. The model immediately knows the project state without me explaining it.

A Real Example

Say I’m working on adding observability to an internal platform—something that needs coordination across multiple teams over several months. The CONTEXT.md includes:

Quick reference table: Status, PM, engineering lead, target dates, links to the PRD and relevant tickets. Everything I’d need to orient myself.
Problem statement: A clear articulation of the user pain. In this case: “Platform incidents go undetected until users report them, and debugging takes hours due to lack of visibility.”
Success metrics with baselines and targets: Things like uptime targets, reduction in mean time to resolution, and alert accuracy. These anchor every conversation about scope.
Key decisions made: A table showing what was decided, when, why, and what alternatives we considered. When someone asks “why aren’t we including component X in v1?”, the answer is already documented.
Open questions: A checklist of unresolved issues. This prevents the AI from assuming things are settled when they’re not.
Links: Direct paths to the PRD, spec, analysis docs, and related pages.

The decisions/ folder contains detailed decision logs for significant choices. The research/ folder holds whatever analysis informed the project direction. The meetings/ folder captures sync notes that would otherwise disappear into Gemini notes in a Google Drive… somewhere.

When to Create a Project Brain

Not every task needs this treatment. I create a project brain when:

The work spans multiple weeks or months. Short-term tasks don’t need the overhead.
Multiple stakeholders are involved. If I need to coordinate with other teams, having a single source of context helps.
Decisions require documented rationale. If someone might ask “why did you do it this way?” later, a decision log is worth the investment.
The project crosses team boundaries. Cross-functional initiatives benefit from dedicated context that doesn’t live in any one team’s space.

For simpler work, I use a flatter folder structure with documents organized by type. Project brains are for the complex initiatives where losing the thread between sessions costs me real time.

How AI Uses Project Brains

This earns its keep when I’m working with AI on project-specific tasks. A few examples:

Preparing for a meeting: “Read the CONTEXT.md in the [project] folder. I have a spec review meeting tomorrow. What are the open questions I should raise?”
Drafting an update: “Based on the project context, draft a status update for leadership. Focus on progress since the start of the month and remaining blockers.”
Decision analysis: “We need to decide whether to include [component] in scope. Read the research folder and the current CONTEXT.md. What would you recommend and why?”

By the time I’m working in it, the AI already knows the project’s history and the people involved, so its recommendations fit this specific situation instead of falling back on generic best practices.

Maintaining the Project Brain

The value depends on keeping CONTEXT.md current. I’ve found a few practices help:

Update after significant events. When a decision is made, a meeting happens, or the status changes, update the file immediately. “I’ll do it later” means it won’t happen. LLMs are great at making these updates, so you can simply say “update relevant files based on the session we just concluded.”
Move open questions to resolved. When a question gets answered, don’t delete it. Mark it resolved and note the answer. This preserves the reasoning trail.
Link, don’t duplicate. CONTEXT.md should point to artifacts, not contain them. Keep PRDs in the artifacts folder. Keep meeting notes in the meetings folder. The context file is a hub, not a repository.

Scaffolding New Projects

I have a slash command that scaffolds new project brains:

/new-project platform-observability

This creates the folder structure, generates a CONTEXT.md from a template, and fills out a rough draft based on whatever context I provide. Removing the friction of setup means I’m more likely to actually use the system. You can view the command here.

The template includes the standard sections (Quick Reference, Problem Statement, Success Metrics, etc.) with placeholder text. I fill in what I know and mark other sections as TBD. Even an incomplete project brain is more useful than scattered notes.

What Surprised Me

A well-organized project brain with sparse content beats a folder full of undifferentiated documents every time, because the AI (and future me) can work with structure far more easily than with a pile of files. The decision logs have paid off the most: when someone asks why we didn’t do something, I point to the log instead of reconstructing my reasoning from memory. And while I built this for the AI, I reference these files constantly myself. Staying on top of the context keeps me oriented too, not just the assistant. The structure stays flexible too. Some projects grow extra subfolders like research/customer-interviews/, others need fewer.

This approach requires discipline to maintain, and the upfront setup takes time. But for complex initiatives where context fragmentation is a real problem, project brains have been worth the investment. The AI becomes a more useful collaborator when it has access to the same context I do.

I’m still iterating on the structure. I suspect the template will look different six months from now as I learn what sections actually get used and which ones I skip every time. I’m not trying to get the folder structure perfect. I just want to stop losing context between conversations, so each time I come back to a project I can build on what I already know.

21 February 2026

The A.I. Disruption Has Arrived, and It Sure Is Fun

Paul Ford writes about vibe coding for the NYT (gift link) and what happens when software suddenly becomes cheap and fast to ship:

There are many arguments against vibe coding through A.I. It is an ecological disaster, with data centers consuming billions of gallons of water for cooling each year; it can generate bad, insecure code; it creates cookie-cutter apps instead of real, thoughtful solutions; the real value is in people, not software. All of these are true and valid. But I’ve been around too long. The web wasn’t “real” software until it was. Blogging wasn’t publishing. Big, serious companies weren’t going to migrate to the cloud, and then one day they did.

And then he brings it home in a way that continues to make him one of my favorite web writers:

The simple truth is that I am less valuable than I used to be. It stings to be made obsolete, but it’s fun to code on the train, too. And if this technology keeps improving, then everyone who tells me how hard it is to make a report, place an order, upgrade an app or update a record — they could get the software they deserve, too. That might be a good trade, long term.

We can grieve what we lost, while also being optimistic about the future AI is unlocking for all of us. It’s uncomfortable, but that’s ok, all technological shifts are.

21 February 2026

The Father-Daughter Divide

Isabel Woodford has a research-heavy essay in The Atlantic about why dads and daughters crave closeness but struggle to find it. 28% of American women are estranged from their father, and even where relationships are intact, they tend to be thinner—more transactional, less emotionally honest—than daughters want.

At the root of the modern father-daughter divide seems to be a mismatch in expectations. Fathers, generally speaking, have for generations been less involved than mothers in their kids’ (and especially their daughters‘) lives. But lots of children today expect more: more emotional support and more egalitarian treatment. Many fathers, though, appear to have struggled to adjust to their daughters’ expectations. The result isn’t a relationship that has suddenly ruptured so much as one that has failed to fully adapt.

And the psychological explanation that cuts deepest:

“What generates closeness is another person’s vulnerability,” Coleman explained, and dads may not be ready for that.

Daughters aren’t asking for grand gestures or dramatic change—they’re asking for their fathers to show up emotionally. Which turns out to be hard for a lot of men who were raised to see that kind of openness as weakness.

21 February 2026

The AI baseline has moved

Geoffrey Huntley wrote about what happens when people finally “get” AI:

If you’re having trouble sleeping because of all the things that you want to create, congratulations. You’ve made it through to the other side of the chasm, and you are developing skills that employers in 2026 are expecting as a bare minimum.

The only question that remains is whether you are going to be a consumer of these tools or someone who understands them deeply and automates your job function? Trust me, you want to be in the latter camp because consumption is now the baseline for employment.

Knowing how to use these tools is no longer a differentiator. The gap is between people who consume AI outputs and people who understand the systems well enough to build on top of them.

For product managers, this means that prompting ChatGPT for a first draft doesn’t count as an AI skill anymore. The question is whether you can wire together agents, automate your own workflows, and spot opportunities others miss because they’re still thinking in manual processes.

2 February 2026

The Jevons Paradox and the Future of Knowledge Work

I keep thinking about this essay by Mike Fisher about what happens when automation makes work easier. His central argument challenges the assumption that’s baked into most AI-and-jobs discourse:

In every domain where automation becomes powerful, the pattern remains consistent. Human expertise becomes more valuable because the total volume of meaningful work increases. Early fears of automation nearly always assume a fixed amount of work being redistributed. But work is not fixed. Work expands when constraints are removed.

He anchors this on the Jevons Paradox—the 19th century observation that improved steam engine efficiency led to more coal consumption, not less. And then he traces the pattern through radiology, where the number of US radiologists grew from 30,723 in 2014 to 36,024 in 2023, despite Hinton’s 2016 prediction that deep learning would make them obsolete within five years.

He concludes:

AI will reshape the profession, but only in the sense that cars reshaped transportation or spreadsheets reshaped finance. Not by eliminating the field, but by expanding its scope. Not by reducing labor, but by elevating it. Not by shrinking opportunity, but by multiplying it. The world does not need fewer people who understand systems. It needs far more of them.

I find this framing useful because it shifts the question from “will AI take my job?” to “how will the work change as the volume increases?” That’s a much more interesting thing to figure out (which is also why I have been so focused on expanding my Product Second Brain).

25 January 2026

Why "Correction of Error" Gets Incidents (and Product Failures) Wrong

I’ve covered “root cause” thinking in incident reviews before, and Lorin Hochstein takes aim at a related issue: AWS’s “Correction of Error” terminology:

I hate the term “Correction of Error” because it implies that incidents occur as a result of errors. As a consequence, it suggests that the function of a post-incident review process is to identify the errors that occurred and to fix them. I think this view of incidents is wrong, and dangerously so: It limits the benefits we can get out of an incident review process.

What makes his critique compelling is the observation that production systems are full of defects that never cause outages:

If your system is currently up (which I bet it is), and if your system currently has multiple undetected defects in it (which I also bet it does), then it cannot be the case that defects are a sufficient condition for incidents to occur. In other words, defects alone can’t explain incidents.

This applies to product work too. When users report problems, our instinct is to find “the bug” and fix it. But often the bug has been there for months—what changed is the context around it. A new user flow, a spike in traffic, a feature interaction we didn’t anticipate. If we stop at “fixed the bug,” we miss the chance to understand why the system let that failure through in the first place.

25 January 2026

Why AI in Interviews Is Bad for Candidates, Not Just Companies

A quick post on LinkedIn about interviewing a candidate who used real-time AI got more engagement than is usual for me. And as often happens when something goes semi-viral, some folks took issue with what I said, so I want to expand on the point I was trying to make (it wasn’t that “AI is cheating”).

Here’s what I wrote:

I had my first experience interviewing a candidate who used real-time AI today. If you’re someone who uses AI daily, it’s so easy to spot. The pause before the answer, the constant eyes flicking to the other screen, the perfectly-manicured 3-point answer…

Friends, just don’t do this. It’s too easy to spot, and it will also set you up for failure, because it might get you a job that you’re not a good fit for, which is bad for everyone.

Use AI in your job, for sure. But don’t use it to get the job. The interview process is about you. Be you.

One response called this “absolutely myopic” (I had to double check I didn’t accidentally post on Hacker News) and asked why candidates shouldn’t use AI if it allows for “a better, more creative answer.” Another suggested that if candidates will use AI on the job anyway, then the “real you” isn’t going to be working, so what’s the difference?

Let’s dig into this.

What interviews are actually for

I don’t interview people to test whether they can produce a good answer to a question. I interview people to understand how they think, what they’ve actually done, and whether we’ll work well together.

When I ask “Tell me about a time you had to make a difficult prioritization decision,” I’m not looking for the theoretically optimal framework. I want to hear your story. The messy details and the trade-offs you wrestled with. The thing you got wrong and what you learned from it. AI can’t give me that. It can only give me a polished summary of what prioritization frameworks exist.

One commenter put it well: “It’s about both the company and the individual, so you will often talk about their real experience, what they did, how they felt, what did they learn, digging deeper into their real experience to find out the interesting things that could make them a good match.”

AI might help you phrase things more clearly. But if it’s generating your answers, you’re hiding the very thing I’m trying to evaluate.

The fit problem

Here’s the part that didn’t seem to land: using AI to get a job you’re not qualified for is bad for you.

Let’s say the AI-assisted interview works. You get hired. Now what? You show up on day one, and the expectations are set based on how you performed in those interviews. But that wasn’t you. That was a performance enhanced by a tool you won’t have in the same way during actual work conversations, whiteboard sessions, and quick chat exchanges where people expect you to just… know things.

I’ve seen what happens when there’s a mismatch between interview performance and actual capability. It’s not a fun experience for anyone, least of all the person who’s now struggling in a role they weren’t ready for. One person called it “artificial buzzword ventriloquism” in the comments. Harsh, but not wrong.

It’s about context, not absolutes

A few commenters suggested that interviews should evolve to assume AI assistance, since that’s how people will actually work. One person wrote: “By prohibiting AI during interviews, the interview environment diverges from actual job conditions and fails to evaluate a critical skill: the ability to effectively use one of the most powerful productivity tools available today.”

I think there’s something to this. In fact, our interview process includes a take-home assessment where we explicitly encourage candidates to use AI. We want to see how they approach a problem, how they structure their thinking, and yes, how they use modern tools to get to a good answer. That’s a legitimate skill worth evaluating.

But that’s different from what happened in my interview, where someone was clearly trying to hide their AI usage while answering questions about their past experience. That’s not “using AI as a tool.” That’s using AI as a mask.

I think candidates should absolutely use AI to prepare for interviews: research the company, practice answering common questions, refine their resume.

But in the interview itself, when I’m asking about your experience and your thinking, I need to hear from you. Not because AI is cheating, but because the whole point is to figure out if you are the right fit for this role and this team. If I can’t evaluate that, we can’t make a good hiring decision. And that’s bad for both of us.

23 January 2026

The invisible 40% of engineering work

Anton Zaides wrote a good post about shadow work in engineering teams. He discovered a senior engineer on his team was spending over 40% of his time on work that didn’t show up anywhere—code reviews, mentoring, ad-hoc support fixes, etc.

This part is important:

The shadow backlog isn’t the problem—in my opinion, that’s probably the work that should have been done in the first place. The solution is to stop doing it under the table and make sure you have space for it. The more people don’t agree with your roadmap because it was decided for them, the more shadow backlog you’ll have.

The shadow backlog is a symptom of a roadmap that doesn’t reflect reality—and that often happens when engineering teams are not involved in planning and prioritization. That is the real fix—making sure everyone understands and is aligned on the roadmap, and making sure this kind of BAU (Business As Usual) work is visible and planned for.

16 January 2026

The B2B Product Leadership Delusion

Jason Knight wrote about a fascinating disconnect between how B2B product leaders rate themselves and how their teams see them. The data from his survey is striking:

Across the board, B2B Product Leaders think they’re doing pretty well in all of these areas, but B2B IC PMs are not convinced. The difference is stark, and they can’t both be right.

The survey measured six core responsibilities—setting strategy, aligning teams, enabling prioritization, fostering ownership, removing blockers, and investing in people. In every category, leaders rated themselves significantly higher than their ICs rated them. Jason offers three possible explanations: leaders are doing poorly and don’t know it, leaders are doing well but not communicating it, or ICs have unreasonable expectations. He concludes:

Product Leaders need to do a much better job of setting expectations within their teams and communicating with them openly and well. IC Product Managers need to do a much better job of understanding the constraints of their business context and, indeed, the business they work for.

I keep coming back to the iceberg effect he mentions—where only some of the work someone does is visible. This cuts both ways. Leaders underestimate how opaque their work is to their teams, and ICs underestimate the constraints leaders are working within. The gap isn’t just about performance; it’s about mutual understanding.