What interviews measure when AI is in the room
The evaluation dynamics that change when you tell a candidate to use every tool they have
Swizec Teller published a piece this week about redesigning software engineer interviews for the age of AI. The core of it: give candidates a real business problem, a stubbed-out codebase, API documentation, and credentials. Let them use AI, Google, the interviewer, anything. Then watch how they navigate. His argument is that leetcode measures typing and memorization. His interview measures whether you can ship, and shipping now involves AI.
He’s right about the mechanics. The interview he describes is closer to the actual job than anything involving a whiteboard and a binary tree. But there are dynamics underneath the design that are worth examining, because they determine who the interview rewards and who it quietly filters out, and those dynamics have changed in ways most hiring teams haven’t processed.
Fractal knowledge is the right signal. Most interviews can’t detect it.
The best concept in Swizec’s piece is what he calls “fractal knowledge”: the idea that real experience reveals itself in layers. The more you’ve worked on something, the more detail you can surface. Ask about the foobinator, and a person who built it can tell you how it talked to the bazingulation, what happened when it failed, how they found out, what they learned. Every answer opens five more layers. Someone who managed or read about the system can talk around it but never quite into it.
This is the right signal for competence, and it maps to something specific in the research. Psychologists call it the illusion of explanatory depth: we think we understand things because they’re familiar, but when we try to explain them in detail, the edges of our understanding become visible. Swizec is using this as a detection tool, and it’s a good one.
The problem is that fractal knowledge detection requires an interviewer who can follow the candidate into the layers. If the interviewer doesn’t have depth in the relevant domain, they can’t tell the difference between genuine fractal detail and fluent surface narration. And AI has made fluent surface narration much easier to produce.
A candidate who spent last weekend reading Claude’s explanation of distributed systems can now talk about consensus algorithms, partition tolerance, and leader election with surprising fluency. They can answer the first two levels of follow-up questions because the AI explanation was thorough. They hit the wall at level three or four, where real hands-on experience would generate the war stories, the edge cases, the “we tried that and here’s what broke.” But an interviewer who isn’t calibrated to probe that deeply won’t reach the wall. They’ll stop at level two and mark the candidate as strong.
I wrote about this dynamic in my confidence piece: in any evaluation where competence isn’t externally measured, the gap between fluency and depth becomes invisible. AI has widened that gap by making fluency cheaper. The interviewer’s job is to probe past fluency into depth, and that job just got harder.
AI in the interview changes who gets rewarded
Swizec’s interview design says: use whatever tools you want, including AI. This is realistic and fair. The actual job involves AI. Testing without it would be like testing a carpenter without letting them use power tools.
But “use whatever tools you want” is not a neutral instruction. It rewards a specific set of meta-skills that are correlated with, but not identical to, engineering competence.
It rewards people who are already fluent with AI coding tools. A candidate who has spent months building with Claude Code will navigate the stubbed-out codebase faster, generate working implementations sooner, and have more time for the operational discussion that Swizec says is the highest-signal part of the interview. A candidate who is equally competent but less AI-fluent will spend their hour wrestling with the tool instead of demonstrating their judgment.
This is the same capability gap I described in the Shadow Dev piece: AI adoption has split engineers into two groups with different effective output. In production, that gap creates team fractures. In interviews, it creates a filter that selects for AI fluency and may accidentally filter out engineers whose depth exceeds their tooling adoption.
Swizec acknowledges this friction in a postscript: “subtle differences between AI systems” can throw off candidates. That’s the surface version. The deeper version is that AI fluency is currently correlated with age, with time spent experimenting, with access to premium tool subscriptions, and with a specific personality profile that enjoys early adoption. None of those correlations are proxies for engineering competence. If your interview systematically advantages them, you’re measuring something real (tool fluency) and something accidental (adoption timing) at the same time, and you can’t tell which one drove the result.
The ownership question is the interview
The most interesting thing in Swizec’s piece is the frame he puts around it: “we have your phone number. When you write code with AI, it’s still your phone number.” The interview, in his design, is ultimately about whether you’d trust this person to own a production system.
That’s the right question. And it’s a question that AI in the room makes both more important and harder to assess.
More important because AI-generated code creates a specific ownership problem. The engineer who ships AI-generated code owns output they didn’t write line by line. They need to understand it well enough to debug it at 3am, to explain it in a postmortem, to refactor it when requirements change. That requires a different kind of ownership than hand-written code: not “I built this” but “I evaluated this, I understand why it works, and I know where it might break.” That’s a judgment skill, and it’s the right thing to test for.
Harder to assess because AI in the interview compresses the building phase and expands the evaluation phase. If the candidate generates a working implementation in 20 minutes using Claude Code and spends the remaining 40 minutes discussing operational concerns, the interviewer is now evaluating judgment and systems thinking rather than coding ability. Those are higher-order skills that are harder to score consistently, harder to rubric, and more susceptible to the interviewer’s own biases about what “senior thinking” sounds like.
I wrote about this in the context of how vague evaluation criteria become enforcement tools: when the criteria shift from observable (did the code compile, did the tests pass) to interpretive (does this person think like a senior engineer), the evaluation becomes more vulnerable to the interviewer’s assumptions about what seniority looks like, sounds like, and acts like. In a coding interview where AI handles the mechanical work, the entire evaluation shifts into this interpretive space. That’s where bias operates, and it’s where the evaluation needs the most structure.
The hidden evaluation: how they handle AI failure
There’s a dimension Swizec doesn’t mention explicitly but that his interview design would surface naturally: what happens when the AI generates something wrong?
In a one-hour coding interview with AI, the candidate will almost certainly encounter a moment where the AI produces plausible but incorrect output. A function that handles the happy path but misses an edge case. A database query that works for the test data but would fail at scale. An API integration that looks right but misunderstands the authentication flow.
What the candidate does in that moment is the highest-signal event in the entire interview.
Do they catch it? How quickly? Do they catch it from reading the code, or only after running it and seeing the failure? Once they catch it, do they understand why it failed? Can they fix it without asking the AI to fix it? Do they adjust their prompting strategy based on what they learned about the AI’s failure mode?
This is the automation bias problem applied to hiring. The candidate who trusts AI output without verification is demonstrating the same cognitive pattern that produces production incidents: the system said it was right, so I assumed it was right. The candidate who catches the error and understands why it happened is demonstrating the judgment skill that makes AI augmentation safe at scale.
If you’re designing an AI-enabled interview, consider seeding the codebase with a subtle issue that the AI is likely to miss or mishandle. Not as a trap, but as a diagnostic. The candidate’s response to AI failure tells you more about their production readiness than their response to AI success.
The control surface for AI-enabled hiring
Swizec’s interview design is a genuine advance. It’s closer to real work, it tests judgment over memorization, and it acknowledges that AI is a tool engineers actually use. But the evaluation layer needs deliberate structure to avoid amplifying the biases that AI introduces.
Separate tool fluency from engineering judgment. If a candidate struggles with Claude Code but demonstrates deep system understanding in the design conversation, that’s a tool adoption gap, not a competence gap. Score them separately. Penalizing someone for being six months behind on AI adoption is like penalizing someone for using Vim instead of VS Code: it tells you about their tooling preferences, not their engineering quality.
Probe past AI-generated fluency. In the behavioral portion, when a candidate describes past work, push to the third and fourth layer of detail. AI can brief someone to level two. Only hands-on experience generates level three. If the candidate can tell you what broke in production, what the debugging session looked like, what they tried that didn’t work before finding the fix, you’ve found fractal knowledge. If they can only tell you what the system does and how it’s architected, you may have found an AI-briefed summary.
Rubric the judgment, not just the output. If the highest-signal part of the interview is the operational discussion (and Swizec is right that it is), build a scoring rubric for that discussion. What questions did they ask? What risks did they identify? What tradeoffs did they name? What did they say about monitoring, failure modes, and scale? Without a rubric, “this person thinks like a senior” becomes an intuition call, and intuition calls are where the Dunning-Kruger effect and the likability tax do their work.
Watch for the AI failure moment. Observe what happens when the AI-generated code doesn’t work. The candidate who says “hmm, let me read what it actually wrote” is demonstrating the skill you’re hiring for. The candidate who immediately re-prompts without understanding the failure is demonstrating the pattern you’re hiring to prevent.
Signals
You’ll know your AI-enabled interview is working when it stops selecting for the people who are best at using Claude Code and starts selecting for the people you’d trust with a pager at 3am. When the candidates you hire demonstrate fractal knowledge about the systems they build, not just fluency about the systems they’ve read about. When the evaluation rubric captures judgment as reliably as it captures output. When the interview feels like working alongside someone, not watching someone perform.
Swizec is right that the old interview, the whiteboard, the leetcode, the memorized algorithm, was already measuring the wrong thing. His replacement is better. The question now is whether the evaluation layer can keep up with what the tools have changed. AI raised the floor of what a candidate can produce in an hour. The interview’s job is to find the ceiling of what they understand about what they produced. That gap between production and understanding is where engineering lives. It’s also where hiring lives, if you know how to look for it.


