Tune the fiddle, or the fingers?

seed Mar 21, 2026

llmagentstooling

In Swedish folklore, Näcken sits by the water playing his fiddle. When someone comes to learn, he asks: should I tune the fiddle to your fingers, or your fingers to the fiddle? The wrong answer costs you your hands.

With LLM agents and CLI tools, the same question comes up. The difference is you can tune both, and you probably should.

This doesn’t apply to grep or curl or git. Models already know those tools deeply from training data. Nobody needs a skill file that explains what grep -r does. The interesting case is the tool that isn’t in the training set: an internal CLI, a new project, something your team wrote last quarter. For those, the model starts from near zero, and both the tool’s interface and the instructions matter.

A tool with clear error messages needs less prompting. A tool with cryptic output forces a longer skill file to compensate. The knowledge lives somewhere: in the interface or in the instructions. You pick.

Plastic surface, stable core

I think of the CLI as two layers. The core is the logic: what the tool does. Around it is the contract surface: flags, output format, error messages. The core changes rarely. The contract surface should be moldable.

Add --json. Include the failing input in error messages. Return distinct exit codes for “not found” vs. “permission denied” vs. “bad argument.” None of this changes what the tool does. It changes how it communicates.

The skill depends on that contract surface, but also on which model reads it. A precise model needs a different skill than one that improvises. The skill is an adapter between the tool’s interface and the model’s behavior. You can’t evaluate either in isolation.

The co-evolution loop: LM, skill file, and CLI contract surface depend on each other and must be evaluated as a complete circuit.

Finding the friction

I’ve been experimenting with an approach that uses the agent as a probe.

Start with a compiled binary and a test suite. The agent gets the binary, a skill file, and tasks to solve. No source code.

Phase 1: Blind run. Agent in a clean sandbox with just the binary and the skill. Run the tests. Record what passes, what fails, full traces.

Phase 2: Self-audit. Show the agent its traces. It identifies where it got stuck, where it misread output, where it lacked information. Out comes a friction report.

Phase 3: Source reveal. Give the agent the source code. It reviews the codebase and classifies each friction point:

Tool gap: the CLI can’t express or report what was needed. Missing flag, ambiguous error, no structured output. Fix the tool.
Skill gap: the CLI can do it, but the skill didn’t mention it or described it poorly. Fix the skill.
Model gap: both surfaces had the information. The agent didn’t use it. Nothing to fix.

Phase 4: Accumulate. Don’t change the CLI after one round. Run several loops first: different models, different skill variants, different task sets. Let the friction reports pile up.

A single run produces specific complaints. The agent got stuck on one edge case and the obvious fix is a narrow flag that solves exactly that. Implement it immediately and you get an interface shaped by individual failures rather than by design. Ad-hoc features that only make sense in the context of one bad trace.

With several rounds, patterns show up. Three models struggle with the same ambiguous exit code. Two task sets need structured errors for the same failure class. Those are real tool gaps. Fix those. Iterate on skill gaps between rounds. Tool changes wait.

Phase 5: Patch and re-run. Apply the accumulated tool fixes, update the skill, repeat from Phase 1.

repeat:
    repeat N times (varying models, skill variants, task sets):
        run agent against tasks (blind, sandboxed)
        agent audits its own traces -> friction report
        reveal source -> agent classifies each friction point
        iterate on skill file between rounds
    review accumulated friction reports for recurring tool gaps
    apply tool fixes only where patterns are clear
    measure improvement
until pass rate stops climbing

The point of all this is separating “the tool needs work” from “the instructions need work” from “the model can’t do this.” In practice those get conflated. Someone watches an agent fail and blames the model when the real problem is a missing --json flag or a skill that forgot to mention --dry-run.