Writing Evals

Crucible is Aether’s evaluation framework. It lets you define test cases for agent behavior — prompts with expected outcomes — and run them against real or fake agents.

[dependencies]
crucible = "0.1"

Defining an eval

An Eval pairs a prompt with assertions about what the agent should do:

use crucible::{Eval, EvalAssertion, WorkingDirectory};

let eval = Eval::new(
    "create-hello-world",
    "Create a file called hello.rs with a main function that prints 'Hello, world!'",
    WorkingDirectory::empty()?,
    vec![
        EvalAssertion::file_exists("hello.rs"),
        EvalAssertion::command_succeeds("rustc hello.rs && ./hello"),
        EvalAssertion::llm_judge(|ctx| {
            format!(
                "Does the file hello.rs contain a valid Rust main function \
                 that prints 'Hello, world!'?\n\nFiles:\n{}",
                ctx.git_diff(None).unwrap_or_default()
            )
        }),
    ],
);

Assertions

Constructor	What it checks
`file_exists(path)`	File was created
`file_matches(path, content)`	File has exact content
`command_succeeds(cmd)`	Command exits with code 0
`command_exit_code(cmd, code)`	Command exits with specific code
`tool_call(name)`	Agent called this tool at least once
`tool_call_with_args(name, args)`	Agent called tool with specific arguments
`tool_call_exact(name, n)`	Agent called tool exactly n times
`tool_call_at_least(name, n)`	Agent called tool at least n times
`tool_call_at_most(name, n)`	Agent called tool at most n times
`llm_judge(fn)`	An LLM evaluates the result (pass/fail)

LLM judge

The llm_judge assertion uses a second LLM to evaluate the agent’s output. The closure receives an LlmJudgeContext:

EvalAssertion::llm_judge(|ctx| {
    // ctx.working_dir — the eval's working directory
    // ctx.original_prompt — the prompt given to the agent
    // ctx.messages — full conversation history
    // ctx.git_diff(commit) — git diff of changes made

    format!("Did the agent correctly implement the feature? Diff:\n{}",
        ctx.git_diff(None).unwrap_or_default())
})

Working directories

Each eval runs in an isolated working directory:

// Empty temp directory
WorkingDirectory::empty()?

// Copy of a local directory
WorkingDirectory::local("./test-fixtures/my-project")?

// Git repo checked out at a specific commit
WorkingDirectory::git_repo(
    "https://github.com/user/repo",
    "start-commit-sha",
    "gold-commit-sha",  // reference solution
    Some("subdirectory"),
)?

Running evals

use crucible::{EvalRunner, EvalsConfig, AetherRunner, FileSystemStore};

// Create the agent runner
let runner = AetherRunner::new(agent_config);

// Create the results store
let store = FileSystemStore::new("./eval-results");

// Configure and run
let run_id = EvalRunner::new(runner, store)
    .with_agent_prompt("You are a coding assistant.")
    .with_output_dir("./eval-output".into())
    .run_evals(evals, EvalsConfig::new(judge_llm)
        .with_batch_size(4)
    )
    .await?;

EvalsConfig

Method	Description
`new(judge_llm)`	Create config with the judge LLM
`with_batch_size(n)`	Run n evals concurrently
`with_batch_delay(duration)`	Delay between batches
`with_serve(bool)`	Start HTTP server for live results

Hooks

Run setup code before the agent starts or before assertions run:

let eval = Eval::new(name, prompt, working_dir, assertions)
    .setup(|dir| async move {
        // Run before the agent — e.g., create test fixtures
        std::fs::write(dir.join("existing.txt"), "content")?;
        Ok(())
    })
    .before_assertions(|dir| async move {
        // Run after the agent, before assertions — e.g., build the project
        Ok(())
    });

Testing without an LLM

Use FakeAgentRunner for deterministic testing of your eval setup:

use crucible::FakeAgentRunner;

let runner = FakeAgentRunner::new(vec![
    AgentRunnerMessage::text("I'll create the file now"),
    AgentRunnerMessage::tool_call("write_file", r#"{"path":"hello.rs","content":"..."}"#),
]);