Claude Code Harness Engineering

In Claude Code, harness engineering means building the control system around the coding agent. Claude can read code, edit files, run shell commands, call tools, use MCP, respond to GitHub comments, and operate semi-autonomously. The harness is what makes that behavior bounded, testable, repeatable, and reviewable.

A good harness answers these questions:

QuestionHarness mechanism
What should Claude know about this repo?CLAUDE.md
What tasks should be repeatable?Skills, commands, prompts
What must Claude never touch?Hooks, permissions, branch protection
How do we know the change works?Tests, typecheck, lint, CI
How do we prevent uncontrolled edits?File guards, approval gates, worktrees
How do we review the output?PR workflow, diff summaries, human review
How do we connect external systems safely?MCP with scoped permissions
How do we measure whether this helps?Evaluation suite, metrics, failure taxonomy

The practical point is simple: a long prompt is not a harness. A real harness combines instructions, executable checks, permission boundaries, review gates, and metrics.

Conventional Harness Stack

A practical Claude Code harness usually has these layers:

Repository instructions
Task prompts / Skills
Tool and file permissions
Hooks
Tests and verification commands
Git workflow
CI / GitHub Actions
Human review
Evaluation and metrics

Do not rely on a single layer. Claude should be useful inside the same software delivery system that would constrain a human engineer.

Repository Harness: CLAUDE.md

CLAUDE.md is the baseline context file for project-specific operating rules.

Good contents include:

# Claude Code Instructions

## Package manager

Use pnpm. Do not use npm or yarn.

## Common commands

- Install: `pnpm install`
- Dev server: `pnpm dev`
- Lint: `pnpm lint`
- Typecheck: `pnpm typecheck`
- Unit tests: `pnpm test`
- Build: `pnpm build`

## Architecture

- App routes live in `src/app`.
- Shared UI components live in `src/components`.
- Business logic lives in `src/features`.
- API clients live in `src/lib/api`.
- Do not put API-fetching logic directly inside presentational components.

## Rules

- Make the smallest safe change.
- Do not refactor unrelated code.
- Do not add dependencies unless explicitly requested.
- Do not edit generated files.
- Do not edit `.env*` files.
- Do not weaken tests to make them pass.
- Do not suppress TypeScript errors with `any` unless justified.

## Before finishing

Report:

1. Summary of the change
2. Files changed
3. Commands run
4. Test results
5. Remaining risks

Vague instructions such as “write clean code” or “be a great engineer” are not enough. Actionable rules are better:

When editing React components:

- Preserve keyboard accessibility.
- Use semantic HTML before ARIA.
- Add loading, empty, error, and success states where relevant.
- Prefer existing components from `src/components/ui`.
- Do not create new styling abstractions unless needed.
- For layout bugs, identify the overflowing or mispositioned element instead of hiding the problem with global CSS.

Task Harness: Structured Task Packets

Claude Code performs better when a task is framed like an engineering ticket.

Weak prompt:

Fix the login bug.

Strong prompt:

Task:
Fix the login redirect bug.

Current behavior:
After successful login, users sometimes remain on `/login`.

Expected behavior:
After successful login, users should be redirected to the original destination or `/dashboard`.

Scope:
- Inspect `src/features/auth`.
- Avoid unrelated refactors.
- Do not change public route names.
- Do not add dependencies.

Verification:
- First reproduce with an existing or new failing test.
- Then implement the minimal fix.
- Run the targeted test.
- Run typecheck.

Output:
- Root cause
- Files changed
- Tests run
- Remaining risk

The conventional pattern is to constrain scope, define expected behavior, define verification, and define output format.

Verification Harness

Claude should not merely produce code. Claude should produce a verified diff.

Use a verification ladder:

format
lint
typecheck
unit tests
integration tests
e2e tests
build

For a frontend project, commands may include:

pnpm lint
pnpm typecheck
pnpm test
pnpm build
pnpm playwright test

Do not always force the full suite. Use targeted verification first.

Example policy:

## Verification policy

For small frontend changes:

1. Run the most relevant unit test.
2. Run `pnpm typecheck`.
3. Run `pnpm lint`.
4. Run `pnpm build` if routing, bundling, or config changed.

For critical flows:

1. Add or update regression tests.
2. Run the affected test file.
3. Run the related integration test.
4. Run the relevant Playwright spec if available.

Useful prompt:

Before changing implementation code, identify the smallest test command that reproduces the problem. If no test exists, add one. After the fix, rerun that test and then run typecheck.

This prevents plausible implementation without proof.

Hook Harness

Hooks are executable guardrails. They can inject context, block risky actions, run checks, or enforce policies outside the model’s discretion.

Practical hook categories:

HookUse
SessionStartPrint repo rules, check environment
UserPromptSubmitAdd branch/status context
PreToolUseBlock dangerous commands or protected files
PostToolUseRun formatter or lint changed files
FileChangedTrigger targeted validation
StopRequire a test summary before completion
SessionEndSave logs or metrics

Block edits to:

.env
.env.local
*.pem
*.key
node_modules/
dist/
build/
coverage/
generated/
package-lock.json when using pnpm
yarn.lock when using pnpm

Require approval for dangerous commands:

rm -rf
sudo
chmod -R
chown -R
git push --force
git reset --hard
docker system prune
kubectl delete
terraform apply

Post-edit automation can include:

prettier changed files
eslint changed files
typecheck if TS files changed
run targeted test if test file exists

Permission Harness

Treat Claude Code as a powerful junior engineer with shell access.

Use:

least privilege
no production secrets
no broad cloud credentials
no unrestricted database write access
approval for destructive commands
separate dev/staging/prod credentials
read-only access where possible

Prefer:

read/write access to repo
read-only access to docs
staging-only API keys
throwaway database
ephemeral branches

Avoid:

production database credentials
production deploy keys
personal SSH keys
cloud admin tokens
write access to billing/payment systems

Access is useful, but unrestricted access is poor engineering.

Git Harness

Claude Code should follow the normal Git workflow.

Recommended flow:

create branch
inspect issue
make plan
change files
run verification
show diff
commit
open PR
CI runs
human reviews
merge

Git rules:

## Git rules

- Never commit directly to `main`.
- Create a feature branch for non-trivial work.
- Keep the diff focused.
- Do not mix formatting-only changes with logic changes.
- Do not commit until relevant tests pass.
- Before committing, show:
  - changed files
  - behavior changed
  - tests run
  - risks

Prompt:

Create a new branch for this fix. Make the smallest safe change. Do not commit yet. After tests pass, show me the diff summary and ask before committing.

This prevents large, messy, hard-to-review diffs.

CI and GitHub Actions Harness

Claude can assist inside GitHub workflows, but CI and humans should remain the gate.

Use Claude for:

PR review
test suggestion
bug reproduction
small implementation tasks
documentation updates
refactor proposals

Do not use Claude for:

auto-merging its own PRs
bypassing CI
approving security-sensitive changes
direct production deployment

Required CI checks can include:

lint
typecheck
unit tests
build
e2e smoke tests
dependency audit
secret scan
human approval

Example PR prompts:

@claude review this PR for regression risk. Focus on authentication, missing tests, and unsafe assumptions. Do not modify files.
@claude implement the smallest fix for the failing test in this PR. Do not change unrelated files. Add a short explanation and leave the PR for human review.

Skill Harness

If you repeatedly ask Claude the same thing, make it a reusable workflow.

Examples:

frontend-review
write-regression-test
accessibility-audit
performance-review
release-checklist
dependency-upgrade
api-contract-review

Example skill:

---
name: frontend-review
description: Review frontend code for correctness, accessibility, performance, and maintainability.
---

# Frontend review workflow

Inspect the diff and check:

1. Rendering correctness
2. TypeScript correctness
3. Accessibility
4. Keyboard navigation
5. Responsive behavior
6. Loading, empty, error, and success states
7. Avoidable re-renders
8. Unnecessary dependencies
9. Test coverage

Return:

- Blocking issues
- Non-blocking suggestions
- Missing tests
- Risk level

Regression test skill:

---
name: write-regression-test
description: Add or update tests that reproduce a bug before fixing it.
---

# Workflow

1. Understand the reported bug.
2. Locate the smallest relevant test file.
3. Add a failing test that reproduces the bug.
4. Run the test and confirm failure.
5. Implement the minimal fix.
6. Rerun the test and confirm pass.
7. Run typecheck.
8. Summarize the root cause and verification.

This turns strong prompts into durable engineering infrastructure.

Subagent Harness

Subagents are useful when specialized reviewers should inspect the work independently.

Example:

Use separate subagents:

1. Security reviewer
2. TypeScript reviewer
3. Test reviewer
4. Frontend UX/accessibility reviewer
5. Performance reviewer

Each reviewer should inspect the diff independently and return only blocking or high-value issues. Then synthesize the findings into one final fix plan.

Useful subagent roles:

SubagentChecks
Security reviewerauth, secrets, injection, unsafe permissions
TypeScript reviewertype soundness, any, API breakage
Test reviewermissing regression tests, weak assertions
Frontend reviewerlayout, a11y, state handling
Performance reviewerunnecessary renders, bundle size, N+1 calls

This is useful for larger PRs or high-risk changes.

MCP Harness

MCP is useful when Claude needs access to external systems such as:

GitHub
Linear / Jira
Sentry
Datadog
Postgres
Figma
internal docs
design systems
CI logs

Good MCP design:

read-only production logs
read-only issue tracker access
read-only design file access
staging database write access only
narrow tools instead of broad shell access
audit logs
explicit approval for mutations

Bad MCP design:

full production database write access
cloud admin credentials
unrestricted deployment access
unrestricted filesystem access outside repo
raw secret-store access

MCP should expose specific capabilities, not unlimited authority.

Evaluation Harness

Do not judge Claude Code by vibes. Create an evaluation suite from real repository history.

Collect 20 to 50 past tasks:

bug fixes
test additions
small features
refactors
accessibility fixes
performance fixes
dependency updates
docs updates

For each task, store:

starting commit
prompt
expected behavior
acceptance test
expected touched files
known pitfalls
review rubric

Score each run:

CategoryScore
Correctness0-5
Minimal diff0-5
Test quality0-5
Maintains conventions0-5
Security0-5
Review effort saved0-5
Cost0-5

The point is to measure whether Claude saves time or creates review debt.

Frontend-specific Harness

For frontend work, make the harness strict.

## Frontend engineering rules

- Use semantic HTML first.
- Preserve keyboard navigation.
- Do not use ARIA to compensate for bad HTML unless necessary.
- Check mobile, tablet, and desktop layouts.
- Avoid layout shift.
- Avoid global CSS changes unless justified.
- Prefer existing design-system components.
- Do not introduce new state libraries.
- Do not add client components unnecessarily.
- Handle loading, empty, error, and success states.
- Avoid `useEffect` for derived state.
- Avoid suppressing hydration errors.

Verification commands:

pnpm lint
pnpm typecheck
pnpm test
pnpm build
pnpm playwright test

UI bug prompt:

Task:
Fix the mobile layout bug.

Current behavior:
On iPhone Safari, the page has horizontal overflow.

Expected behavior:
No horizontal scroll at any viewport width.

Constraints:
- Do not use global `overflow-x: hidden` as the primary fix.
- Find the actual overflowing element.
- Keep the diff minimal.
- Do not refactor unrelated layout code.

Verification:
- Identify root cause.
- Test at mobile viewport widths.
- Run lint and build.

Output:
- Root cause
- Files changed
- Verification
- Remaining risk

This is better than asking Claude to “fix responsive issue.”

Workflow Templates

Bug Fix

1. Reproduce the bug.
2. Add or identify a failing test.
3. Make the smallest fix.
4. Rerun the failing test.
5. Run broader affected checks.
6. Summarize root cause and risk.

Prompt:

Fix this bug using a failing-test-first workflow. Do not change implementation until you have reproduced the issue with a test or a clear command. Keep the diff minimal.

Refactor

1. Define behavior that must not change.
2. Add characterization tests if missing.
3. Refactor in small commits.
4. Run tests after each phase.
5. Avoid public API changes.

Prompt:

Refactor this module without changing behavior. First identify the public API and existing tests. Add characterization tests if needed. Then refactor in small steps and run tests.

PR Review

1. Inspect diff.
2. Classify risk.
3. Check tests.
4. Check security.
5. Check maintainability.
6. Return blocking issues first.

Prompt:

Review this PR as a senior engineer. Focus on correctness, test gaps, security, and project convention violations. Return blocking issues first. Do not comment on style unless it affects maintainability.

Feature Implementation

1. Restate requirements.
2. Identify files likely involved.
3. Create implementation plan.
4. Implement minimal vertical slice.
5. Add tests.
6. Run verification.
7. Summarize.

Prompt:

Implement this feature as a minimal vertical slice. Do not introduce abstractions until needed. Add tests for the main behavior and one edge case.

Dependency Upgrade

1. Read changelog.
2. Identify breaking changes.
3. Upgrade package.
4. Fix compile/test failures.
5. Run focused and full checks.
6. Document migration notes.

Prompt:

Upgrade this dependency safely. Read the migration notes first. Make the smallest changes needed. Do not upgrade unrelated packages.

Anti-patterns

Avoid giant vague prompts:

Improve this repo.

Use focused prompts:

Inspect `src/features/payment` and identify the top 5 concrete maintainability risks. Do not edit files yet.

Avoid tasks without test requirements:

Fix it and tell me when done.

Use:

Fix it and report exact verification commands and results.

Forbid unrelated refactors:

Do not refactor unrelated code. If you see unrelated issues, list them separately instead of changing them.

Watch for correctness weakening:

commenting out failing tests
loosening assertions
adding `as any`
disabling lint rules
removing type checks
changing CI config to pass
adding broad try/catch blocks
swallowing errors

The harness should explicitly forbid these.

Implementation Plan

Set up a Claude Code harness in phases.

Phase 1: Basic Repo Harness

CLAUDE.md
standard task prompt templates
verification ladder
Git rules
completion format

Phase 2: Safety Harness

protected file list
dangerous command list
approval rules
secret scanning
branch protection

Phase 3: Test Harness

targeted test commands
unit test conventions
e2e smoke tests
frontend accessibility checks
build verification

Phase 4: Workflow Harness

Create Skills for:

bug fix
frontend review
write regression test
release checklist
dependency upgrade
accessibility audit

Phase 5: CI Harness

Claude GitHub Action
required checks
no auto-merge
human review
PR labels for AI-generated changes

Phase 6: Evaluation Harness

Track:

acceptance rate
review comments
test failures
escaped bugs
average diff size
cost per PR
time saved
failure patterns

Strong Default Harness

Use this as a starting point:

# Claude Code Harness

## Operating principles

- Make minimal, focused changes.
- Prefer correctness over cleverness.
- Do not refactor unrelated code.
- Do not add dependencies without justification.
- Do not edit generated files.
- Do not edit secrets.
- Do not weaken tests, types, lint, auth, or validation.

## Commands

- Install: `pnpm install`
- Lint: `pnpm lint`
- Typecheck: `pnpm typecheck`
- Test: `pnpm test`
- Build: `pnpm build`

## Frontend rules

- Use semantic HTML.
- Preserve keyboard accessibility.
- Check responsive layouts.
- Handle loading, empty, error, and success states.
- Reuse existing components.
- Avoid unnecessary client-side state.
- Avoid unnecessary `useEffect`.
- Do not suppress hydration errors without explanation.

## Bug-fix workflow

1. Reproduce the bug.
2. Add or identify a failing test.
3. Make the smallest fix.
4. Rerun the failing test.
5. Run broader affected checks.
6. Summarize root cause.

## Review workflow

Check:

- correctness
- missing tests
- security
- accessibility
- performance
- maintainability
- project convention violations

Return blocking issues first.

## Git rules

- Never commit directly to main.
- Use feature branches.
- Keep diffs focused.
- Do not commit without verification.
- Do not push without user approval.

## Completion format

Return:

1. Summary
2. Root cause, if applicable
3. Files changed
4. Verification commands and results
5. Remaining risks

References

Bottom Line

The practical way to do Claude Code harness engineering is not to write longer prompts. It is to build a controlled development system around Claude:

CLAUDE.md
+ task templates
+ Skills
+ hooks
+ file/command guards
+ tests
+ CI
+ Git discipline
+ human review
+ evaluation metrics

Claude Code proposes and edits. The harness constrains and verifies.