Menu

Endgame for the open web

Anil Dash has a long essay on the state of the open web and not all of it rings true for me, but buried in the opening is a wonderful definition of what the open web actually is:

The open web is something extraordinary: anybody can use whatever tools they have, to create content following publicly documented specifications, published using completely free and open platforms, and then share that work with anyone, anywhere in the world, without asking for permission from anyone. Think about how radical that is.

It does feel like if the web got invented in 2026, it would not have been left as an open technology for long (see also AI and how much open source models are lagging).

Negative space in writing

Tracy Durnell explores non-visual negative space—what happens when writing leaves room for the reader to think:

The current design trend of business and self-help style books is to use tons of subheadings and callout boxes and always, a list of the key points at the end of the chapter. While this is a highly skimmable format and often nice visual design, it essentially sucks the negative space out of the text — the places in which the reader might step back and consider their own examples or anticipate what point the author is trying to make. There’s no time for hunches here.

And:

The negative space of the text helps build the aesthetic experience. Small details flavor the text with a sense of reality. Drawing out events — leaving questions unresolved and conflicts unsettled — can build tension. And textual space creates a gap for the reader to make the personal decodings of the text that build meaning.

Not everything has to get to the point immediately. Sometimes the best thing a writer can do is leave room for the reader to get there on their own. I’m thinking about this because I’m currently reading The Will of the Many. It is slow, and long, and one of the best books I’ve read in ages. The negative space is probably a big reason why I love it so much.

Agentic manual testing

Simon Willison has a practical guide on manual testing with coding agents. Two tips I’ve already started using:

It’s still quick for an agent to write out a demo file and then compile and run it. I sometimes encourage it to use /tmp purely to avoid those files being accidentally committed to the repository later on.

And:

If an agent finds something that doesn’t work through their manual testing, I like to tell them to fix it with red/green TDD. This ensures the new case ends up covered by the permanent automated tests.

From Assistant to Collaborator: How My AI Second Brain Grew Up

Over the past few months I’ve been writing about how I use AI for product work. The first post covered the philosophy: context files, opinionated prompts, and how to compose the right inputs for each task. The second added slash commands and daily summaries. The third was a hands-on setup guide. And the fourth introduced project brains for keeping complex initiatives organized.

This post covers a different kind of change. The earlier additions were incremental: more commands, better context, smoother workflows. What changed recently feels more like a threshold. The system went from a tool I invoke for specific tasks to something closer to a collaborator I dispatch to do real work. Three capabilities drove that shift: multi-agent orchestration, cross-session memory, and the encoding of domain expertise into the system itself.

Multi-Agent Workflows

The clearest example is customer escalation investigations. As a PM for data products, I regularly investigate customer-reported issues: logging gaps, data discrepancies, behavior that doesn’t match expectations. These investigations require pulling information from multiple sources and cross-referencing it all into an analysis that engineering can act on.

I built a slash command that handles this as a multi-phase workflow. When I run it with a ticket ID, here’s what happens:

  1. The system reads the customer ticket, extracts the core problem, identifies which product area is involved, and classifies the issue type.
  2. Three specialist agents launch simultaneously, each focused on a different data source. One searches the codebase for the relevant logic and recent changes. Another searches for related tickets and prior incidents across projects. A third checks documentation and internal wiki pages for relevant operational context.
  3. A fourth agent receives the combined findings and produces database queries that can confirm or refute the working hypothesis.
  4. The system combines everything into a structured analysis: issue classification, root cause anchored in code where possible, customer impact, and recommended next steps.
  5. A blind validator independently re-fetches every source cited in the draft to verify the claims hold up. Then an adversarial challenger looks for alternative explanations and tests whether the classification is correct.

The output is a document I can review with an engineering colleague or paste into a chat thread. It includes a confidence assessment and a data collection status table showing what was checked and what was unavailable, along with how the analysis compensated for gaps.

The command file that orchestrates all of this isn’t prompting in the traditional sense. It defines which agents to dispatch, what information each one needs, when to wait for results before proceeding, and how to handle failures gracefully. Writing this felt more like designing a workflow than writing a prompt.

I’ve applied the same pattern to other tasks. A “fix feasibility” command evaluates whether a ticket describes a code change simple enough for a PM to implement with AI coding assistance, and produces an implementation brief if the answer is yes. The specific use cases differ, but the architecture is the same: break the problem into specialist tasks that run in parallel, then synthesize and validate the results.

Cross-Session Memory

AI conversations are stateless by default. Every new session starts from zero, which means re-explaining context that should already be established. Over a few weeks of working on the same projects, this friction adds up.

I addressed this with a four-layer memory system:

  • The first layer is stable facts: a compact file that captures the current state of all active work, including project status, recent decisions, and environment constraints. This is the primary orientation file. When I start a session, the AI reads it and immediately knows what’s in flight.
  • The second is a session log: a reverse-chronological list of handoff notes. Each entry records what happened in a session and what threads remain open. The last three entries give enough context to pick up where I left off.
  • Third, a corrections file. This holds behavioral fixes for things the AI consistently gets wrong. It’s a staging area that should shrink over time as fixes get promoted elsewhere.
  • And finally, a decisions log: a cross-cutting record of decisions that don’t belong to a specific project. Each entry captures context and rationale so I don’t relitigate settled questions.

Two commands manage this. /session-start loads all four files and presents a brief summary of current state and recent sessions. /session-end reviews the conversation, writes a handoff note, and then checks whether any learnings should be promoted to infrastructure.

“Promote to infrastructure” means taking something learned during a session and baking it into the files the agent actually reads. A correction about how to handle a specific edge case in escalation investigations might start in the corrections file, then get promoted into the escalation command or a domain skill once it’s validated. The corrections file shrinks over time as that knowledge moves into the right places.

This creates a loop where the system improves its own instructions. I approve every change, so it’s not self-modifying in a creepy way. But in practice each work session can make the next one slightly better, and the compound effect over weeks is noticeable.

Domain Expertise

The earlier posts described skills like pm-thinking, which applies product methodology (problem-first thinking, measurable outcomes) to any PM-related conversation. That’s useful, but generic. It works the same way regardless of what product you’re building.

The bigger shift was building skills that encode institutional knowledge about specific products. I now have skills for each major product area my team owns: log delivery, analytics, audit logs, alerting, and data pipelines. Each skill contains the product’s architecture and common failure modes, along with which code repositories to search and which database tables hold relevant data.

This is what makes the multi-agent workflows useful. When the code investigator agent examines an escalation about missing logs, the domain skill tells it which service handles job state and which repository contains the delivery pipeline. It also flags recent architectural changes that might be relevant. Without that context, the agent produces plausible-sounding analysis that misses the specific details engineering needs.

Now every investigation that uses a skill validates or extends the knowledge it contains, and /session-end catches insights that should be added back.

How The Work Changes

The biggest change is in my own role. It’s gone from “write the right prompt” to “design the right process.” The escalation command is a workflow with phases, dependencies, and validation steps, and thinking about it that way beats trying to pack everything into a single conversation. A few other things I’ve noticed:

  • Validation has to be built in. The blind validator exists because agents make mistakes. They cite files that don’t exist, mischaracterize what code does, or draw conclusions the evidence doesn’t support. Catching those issues before they reach anyone else is the whole point.
  • Cross-session memory requires discipline. The system only works if I run /session-end after substantive sessions and keep stable facts current. When I skip it, the next session starts cold and I lose the compounding benefit. Automation helps, but the commitment to maintain the memory is mine.
  • And domain skills need regular maintenance. Products change. Code gets refactored, pipelines get rearchitected. Skills that aren’t periodically updated drift from reality. I haven’t solved this well yet. It’s still a manual process of noticing when a skill’s knowledge is stale and updating it.

The system still makes mistakes. Multi-agent workflows are more thorough than single-prompt conversations, but they’re not infallible. The confidence assessment in the escalation output exists because sometimes the answer is “medium confidence, we couldn’t confirm this from the available data.” That honesty about limitations is more useful than false certainty.

Where This Is Going

I’m sure the specific commands and skills will look different in six months as I learn what works and what doesn’t. But the underlying pattern feels durable: compose specialist agents with deep domain context, validate their output, and feed learnings back into the system.

I’ve published updated files to the Product AI Public repo, including the session memory commands and a generalized version of the multi-agent escalation workflow. If you’re building something similar, those might be useful starting points.

None of these pieces does much on its own. It’s the way they feed each other that turned a pile of separate prompts into something I lean on every day.

When Using AI Leads to “Brain Fry"

I am definitely feeling the “brain fry” right now:

We found that the phenomenon described in these posts—cognitive exhaustion from intensive oversight of AI agents—is both real and significant. We call it “AI brain fry,” which we define as mental fatigue from excessive use or oversight of AI tools beyond one’s cognitive capacity. Participants described a “buzzing” feeling or a mental fog with difficulty focusing, slower decision-making, and headaches.

The research is fascinating and worth reading, with super interesting findings like this:

 As employees go from using one AI tool to two simultaneously, they experience a significant increase in productivity. As they incorporate a third tool, productivity again increases, but at a lower rate. After three tools, though, productivity scores dipped. Multitasking is notoriously unproductive, and yet we fall for its allure time and again.

Earlier this week I had this thought: “Oh no, I think I’ve blown out my context window. I wish I could add some more tokens to my brain. Until then I might just have to respond to new requests with 401 Unauthorized.”

And that’s when I realized I probably need to go touch grass or something.

AI should help us produce better code

As usual, Simon Willison hits the nail on the head here:

If adopting coding agents demonstrably reduces the quality of the code and features you are producing, you should address that problem directly: figure out which aspects of your process are hurting the quality of your output and fix them. Shipping worse code with agents is a choice. We can choose to ship code that is better instead.

Also see Mitchell Hashimoto’s idea of “harness engineering”:

It is the idea that anytime you find an agent makes a mistake, you take the time to engineer a solution such that the agent never makes that mistake again.

On Meeting Your Child Again, and Again

Derek Thompson wrote a wonderful essay on what happens when you become a parent:

The baby you bring home from the hospital is not the baby you rock to sleep at two weeks, and the baby at three months is a complete stranger to both. In a phenomenological sense, parenting a newborn is not at all like parenting “a” singular newborn, but rather like parenting hundreds of babies, each one replacing the previous week’s child, yet retaining her basic facial structure. “Parenthood abruptly catapults us into a permanent relationship with a stranger,” Andrew Solomon wrote in Far From the Tree. Almost. Parenthood catapults us into a permanent relationship with strangers, plural to the extreme.

Why It's Still Valuable To Learn To Code

Carson Gross has a good essay on whether junior programmers should still learn to code given how capable AI has become. His core warning to students:

Yes, AI can generate the code for this assignment. Don’t let it. You have to write the code. I explain that, if they don’t write the code, they will not be able to effectively read the code. The ability to read code is certainly going to be valuable, maybe more valuable, in an AI-based coding future. If you can’t read the code you are going to fall into The Sorcerer’s Apprentice Trap, creating systems you don’t understand and can’t control.

And on what separates senior engineers who can use AI well from those who can’t:

Senior programmers who already have a lot of experience from the pre-AI era are in a good spot to use LLMs effectively: they know what ‘good’ code looks like, they have experience with building larger systems and know what matters and what doesn’t. The danger with senior programmers is that they stop programming entirely and start suffering from brain rot.

This maps directly onto what I’ve been writing about with AI for product work and the second brain setup I’ve built. The system works because I spent years writing and reading PRDs, strategy docs, and OKRs—enough to develop actual opinions about what good looks like. You have to do the work first, then the second brain is worth building.

An AI Wake-Up Call

Matt Shumer’s Something Big Is Happening has made the rounds over the last couple of weeks, but just in case you haven’t seen it, I think it’s very much worth reading. He’s an AI startup founder writing for the non-technical people in his life:

AI isn’t replacing one specific skill. It’s a general substitute for cognitive work. It gets better at everything simultaneously. When factories automated, a displaced worker could retrain as an office worker. When the internet disrupted retail, workers moved into logistics or services. But AI doesn’t leave a convenient gap to move into. Whatever you retrain for, it’s improving at that too.

Previous waves of automation always left somewhere to go. The uncomfortable implication here is that the escape routes are closing as fast as they open.

There are too many quotes worth commenting on, but this observation about what we tell our kids feels important:

The people most likely to thrive are the ones who are deeply curious, adaptable, and effective at using AI to do things they actually care about. Teach your kids to be builders and learners, not to optimize for a career path that might not exist by the time they graduate.

Predictions about the pace of change tend to be simultaneously too aggressive and too conservative in ways that are hard to anticipate. But the direction feels right, and the practical advice is sound: use the tools seriously, don’t assume they can’t do something just because it seems too hard, and spend your energy adapting rather than debating whether this is real.

Toolshed, blueprints, and why good agents need good DevEx

Alistair Gray published part two of Stripe’s “Minions” series, going deeper on how they built their internal coding agents. It’s a great read throughout, but three ideas really stood out to me.

First, blueprints. These are workflows that mix deterministic steps with agentic ones:

Blueprints are workflows defined in code that direct a minion run. Blueprints combine the determinism of workflows with agents’ flexibility in dealing with the unknown: a given node can run either deterministic code or an agent loop focused on a task. In essence, a blueprint is like a collection of agent skills interwoven with deterministic code so that particular subtasks can be handled most appropriately.

If you know a step should always happen the same way, don’t let an LLM decide how to do it. Let the agent handle the ambiguous parts, and hardcode the rest (this can also dramatically reduce token cost).

Second, their centralized MCP server:

We built a centralized internal MCP server called Toolshed, which makes it easy for Stripe engineers to author new tools and make them automatically discoverable to our agentic systems. All our agentic systems are able to use Toolshed as a shared capability layer; adding a tool to Toolshed immediately grants capabilities to our whole fleet of hundreds of different agents.

A shared tool layer that all agents can use… 500 tools, one server, hundreds of agents. Very cool idea.

And third, what they call “shifting feedback left”:

We have pre-push hooks to fix the most common lint issues. A background daemon precomputes lint rule heuristics that apply to a change and caches the results of running those lints, so developers can usually get lint fixes in well under a second on a push.

If you can catch a problem before it hits CI, do it there. A sub-second lint fix on push is better than a 10-minute CI failure, whether you’re a person or an LLM burning tokens.

So much of Stripe’s agent success is built on top of investments they made for human developer productivity. Good dev environments, fast feedback loops, shared tooling. The agents benefit from all of it, and developers remain in control.