Menu

Measuring AI's Impact on Shipping Speed and Code Quality

Will Larson has a good post about how they’re adopting AI at his company. The process is interesting, but this is the part that jumped out at me:

My biggest fear for AI adoption is that they can focus on creating the impression of adopting AI, rather than focusing on creating additional productivity. Optics are a core part of any work, but almost all interesting work occurs where optics and reality intersect.

It’s really hard to figure out if AI tools are (1) helping teams ship faster (2) without sacrificing quality.

We’re working on figuring out this problem right now at Cloudflare. Our proposed approach sidesteps the problem of per-commit AI attribution (did Copilot write this line? did Claude?) by correlating team-level AI tool usage with team-level health metrics over time. If a team’s AI adoption increases by 30% and their change failure rate stays stable, that’s a useful signal. If AI usage spikes and incidents start trending up, that’s worth investigating.

The key insight is that you don’t need perfect attribution to get directionally useful data. Correlation isn’t causation, and teams adopting AI tools may already be more experimental or higher-performing. But at least you’re measuring something real instead of the something like “# of lines written by AI”, which leads straight to the Goodhart’s Law problem where metrics become targets.

How I Use AI for Product Work

Update! I wrote a follow-up post here: How My AI Product “Second Brain” Evolved.

I’ve been refining my approach to using LLMs for product work, and I figured it’s time to write up how I actually use them day-to-day.

I think the most valuable thing an AI assistant can do isn’t to write your PRD or draft your strategy docs. It’s to push back on weak reasoning, spot gaps you missed, and force you to articulate why your idea is actually good. It’s less a ghostwriter and more a skeptical colleague who shares your product philosophy. With the right prompts AI assistants are also really good at creating background and framing documents (such as explainers that synthesize complex topics, summaries of technical concepts, etc.).

So let’s walk through how I’ve set this up, what makes it work, and how you might build something similar for yourself.

The Philosophy: Sparring Partner, Not Ghostwriter

I believe LLMs are most useful when you give them two things: context and constraints.

  • Context tells the model who you are, what you’re working on, and what “good” looks like in your world.
  • Constraints keep the model from going off the rails with generic advice or hallucinated frameworks.

Every prompt I use is designed to provide both. They’re opinionated on purpose. I’d rather have an assistant that pushes back on bad ideas than one that says “Great idea!” to everything.

The goal isn’t to have AI write presentations for me. It’s to have a thinking partner that:

  • Challenges weak problem statements before I waste time on solutions
  • Spots missing success criteria I forgot to define
  • Asks “why?” when my reasoning gets hand-wavy
  • Points out when I’m jumping to solutions before understanding the problem
  • Helps me create background docs and explainers that set context for others

The Building Blocks

The system works because of how all the pieces fit together. Here’s an overview of the general folder structure I maintain with a series of Markdown files inside:

llm-prompts/
├── prompts/           # System prompts for different use cases
│   ├── pm/            # Product management prompts
│   └── technical/     # Technical/engineering prompts
├── context/           # Personal context files (who I am, how I work)
├── reference/         # Syntax guides and reference docs
└── work/              # Saved feedback and refined docs

The magic isn’t in any single prompt—it’s in how you combine them. Let me break down each layer.

Layer 1: System Prompts

These are the instructions that tell the AI how to behave for a specific task. I have different prompts for different jobs:

  • General PM sparring: A prompt that knows my product philosophy and pushes back on weak reasoning. I use this for thinking through tradeoffs, preparing for meetings, and sanity-checking my approach.
  • Document review: Prompts specifically designed to critique PRDs, OKRs, strategy docs, and other artifacts. These encode what “good” looks like and call out common anti-patterns.
  • Idea stress-testing: A prompt that I stole from my friend Stephen, which simulates a debate between an optimist and a skeptic to pressure-test new ideas before I get too attached to them.
  • Technical understanding: Prompts that help me understand systems, architectural decisions, and technical concepts well enough to lead effectively (I’m not an engineer, but I need to hold my own in architecture reviews).

The key is that each prompt is opinionated. They’re not generic “be helpful” instructions—they encode specific philosophies about what good work looks like.

Layer 2: Personal Context

This is where it gets powerful. I maintain files that describe:

  • Who I am: My role, my experience, my communication style
  • How I work: My product philosophy, my management approach, my values
  • What I’m working on: Current projects, team context, company priorities

When I start a conversation, I can pull in the relevant context files alongside my prompt. The model then has the background it needs to give me advice that actually fits my situation—not generic best practices from a blog post.

Layer 3: Reference Materials

Sometimes you need the model to follow specific formats or conventions. I keep reference files for things like wiki markup syntax, documentation templates, or internal style guides. These ensure the output is actually usable without a bunch of reformatting.

How I Actually Use This

I use Windsurf as my daily driver, and it has a feature that makes this whole system work: the @ mention. In the chat panel, I can reference any file by typing @ followed by the path. Windsurf then includes that file’s contents as context for the conversation.

This means I can compose my “assistant” on the fly by combining:

  1. A system prompt for the task at hand
  2. Relevant personal context files
  3. The document or code I’m working on

Example: Document Review

When I need feedback on a PRD before sharing it with stakeholders, I’ll start a conversation and reference my PRD review prompt plus my product philosophy context. Then I paste in the PRD and ask for critique.

The model comes back with feedback grounded in my own standards—not generic advice. It’ll call out if my problem statement is vague, if my success metrics aren’t measurable, or if I’m jumping to solutions before properly framing the problem. Exactly the kind of pushback I’d want from a senior colleague.

Example: Brainstorming Partner

For early-stage thinking, I use a more conversational prompt that knows how I like to explore ideas. I’ll describe what I’m thinking about and ask it to poke holes, suggest angles I haven’t considered, or help me articulate why something feels off.

This is particularly useful before big meetings. I can rehearse my reasoning and get challenged on the weak spots before I’m in front of stakeholders.

Example: Technical Understanding

I’m not an engineer, but I work with technical teams. When I need to understand how a system works—well enough to ask good questions or spot when something doesn’t add up—I use prompts designed for technical explanation.

The key is that these prompts know to explain things without condescension but also without assuming I know the jargon. They cite specific files and line numbers when relevant, and they explain the “why” behind design decisions.

Connecting to Real Data

One feature that’s made a big difference is MCP (Model Context Protocol) servers. These connect the AI to external data sources—internal wikis, documentation sites, code repositories, APIs—so it can ground its responses in actual information rather than just its training data.

In my prompts, I tell the model which MCP servers are available and when to use them. For example, my technical prompts instruct the model to:

  • Search official documentation first to ground answers in verified information
  • Check internal wikis for known issues, edge cases, and workarounds
  • Look at code repositories when documentation is incomplete
  • Always cite sources with links so I can verify

This turns the AI from a general-purpose assistant into something more like an expert who has access to your company’s actual knowledge base. The difference in answer quality is significant—instead of generic advice, I get responses that reference real docs and real code.

Keeping a Record

One practical tip: save the conversation output somewhere useful.

I have a work/ folder organized by topic where I save feedback and refined thinking. When the model gives me good critique on a PRD, I’ll ask it to write a summary of the key issues to a Markdown file I can reference later. This keeps the insights from getting lost in chat history.

What I’ve Learned

A few things that have made this work better over time:

  1. Context files are worth the investment. I have files that describe who I am, how I work, and what I value. Updating these takes time, but it pays off in every conversation.
  2. Push back is a feature, not a bug. These prompts are designed to challenge bad thinking. If the model is pushing back on your approach, consider that it might be right.
  3. Iterate on the prompts. I update these regularly based on what works and what doesn’t. If a prompt isn’t helping, change it.
  4. Less context is often more. Including too much context can dilute the signal. Start with the minimum you need, add more if the model seems confused.

This setup isn’t a silver bullet that makes thinking go away—it’s just a way to encode my preferences and philosophies into something an LLM can use as a baseline for pushing back on my thinking. I still write my own PRDs, OKRs, and strategy docs—the artifacts that represent my actual thinking. But I let AI help me create background documents, explainers, and context-setting materials. And I have a sparring partner that catches the gaps I miss, challenges the assumptions I glossed over, and asks the uncomfortable questions before stakeholders do.

If you build something similar, I’d love to hear how it goes. The prompts are important, but what matters equally as much is the structure—context plus constraints, composed on the fly for the task at hand. That’s the thing that makes it work.

Update! I wrote a follow-up post here: How My AI Product “Second Brain” Evolved.

"Disagree and Let’s See"

I like this alternative to the “Disagree and Commit” saying:

“Disagree and let’s see” allows you to stay aligned with the team without forcing you to pretend you had conviction you didn’t have. It lets you walk into a room with your team and be honest:

“Here’s the path that was chosen. It wasn’t my first pick, but here’s the experiment we’re running, and here’s what we’re trying to learn.”

That’s a much more authentic stance for most leaders than repeating something with a tight smile and hoping no one notices your doubt.

Source: “Disagree and Let’s See”

New side project: Discord Stock & Crypto Bot

Not sure how many people would be interested in this, but it was fun to make so I thought I’d share. This is a Discord bot that provides real-time stock and cryptocurrency information, 30-day price trends, and AI-powered news summaries through slash commands. When you add the bot to Discord you can use the /stock and /crypto commands to get information like this:

Want to add it to your Discord server? Head over here!

Horrible edge cases to consider when dealing with music

Metadata is the hardest problem in software, and these examples prove my point. Don’t @ me!

My favourite: a band named brouillard, with a single member called brouillard, whose every single album is named brouillard, and of course, so is every single track.

Source: Horrible edge cases to consider when dealing with music

Brief thoughts on the recent Cloudflare outage

Lorin Hochstein is a big name in the LFI (Learning From Incidents) space. He often writes about post-incident reviews, and he has a very interesting write-up of the Cloudflare outage on November 18, 2025 blog post. I especially loved this part:

Companies generally err on the side of saying less rather than more. After all, if you provide more detail, you open yourself up to criticism that the failure was due to poor engineering. The fewer details you provide, the fewer things people can call you out on. It’s not hard to find people online criticizing Cloudflare online using the details they provided as the basis for their criticism.

I think it would advance our industry if people held the opposite view: the more details that are provided an incident writeup, the higher esteem we should hold that organization. I respect Cloudflare is an engineering organization a lot more precisely because they are willing to provide these sorts of details. I don’t want to hear what Cloudflare should have done from people who weren’t there, I want to hear us hold other companies up to Cloudflare’s standard for describing the details of a failure mode and the inherently confusing nature of incident response.

Source: Brief thoughts on the recent Cloudflare outage

The price of admission

Some tough love here about what it means to have “executive presence”.

When someone tells you that you need more business sense, or that you’re not ready for more scope, or that you need to level up, this is typically what they’re trying to communicate. That you’re more concerned with how work happens than with what work should happen in the first place.

Source: The price of admission

How I give the right amount of context (in any situation)

A great list of things to keep in mind when communicating via writing. The article is focused on “managing up” but these principles are relevant in a much broader context as well.

What questions does your manager usually ask? Answer those questions yourself. If you take anything away from this article, make it this. Every manager has their own idiosyncrasies, worldview, values, etc. That’s why the best thing to do is to pattern match. Consider what they’ve asked you in the past, when talking to you or others. Try to give context through that lens.

Source: How I give the right amount of context (in any situation)

Selling Lemons

This is an essay I think everyone should read, front to back. It’s about all the things we are living through right now, but it’s especially about work (and AI):

I’m not sure hiring can ever be much more efficient, because neither side has reason to show themselves as they really are, warts and all. Idealistically, both would come straight; pragmatically, it is a game of chicken. Candidates polish résumés and present curated versions of their abilities, listing outcomes and impact statistics with dubious accuracy and provenance. Companies do the same, putting culture and mission front and center while hiding systematic dysfunctions and looming existential risks. When neither side is forthcoming, you’re left with proxies: a famous logo on a resume, a polished culture deck.

Source: Selling Lemons

ChatGPT Is Blowing Up Marriages as It Goads Spouses Into Divorce

Wild story:

Multiple people we spoke to for this story lamented feeling “ganged up on” as a partner used chatbot outputs against them during arguments or moments of marital crisis. One of these sources, a man who’s now in the process of selling his home as he and his spouse barrel toward divorce, recounted feeling voiceless as his partner turned to ChatGPT to pathologize their relationship. “I was really hurt by the way [ChatGPT] was being used against me,” said the man, speaking through tears. “I felt like it was being leveraged… like, ‘I didn’t feel great about whatever happened, and so I went to ChatGPT, and ChatGPT said that you’re not a supportive partner, and this is what a supporting partner would do.’”

I think we have to realize that non-tech people don’t have a good understanding of the sycophantic nature of LLM bots, so we’ll see more and more examples like this.

Source: ChatGPT Is Blowing Up Marriages as It Goads Spouses Into Divorce