Skip to content
Theme:

Writing Evals

Crucible is Aether’s evaluation framework. It lets you define test cases for agent behavior — prompts with expected outcomes — and run them against real or fake agents.

Cargo.toml
[dependencies]
crucible = "0.1"

An Eval pairs a prompt with assertions about what the agent should do:

use crucible::{Eval, EvalAssertion, WorkingDirectory};
let eval = Eval::new(
"create-hello-world",
"Create a file called hello.rs with a main function that prints 'Hello, world!'",
WorkingDirectory::empty()?,
vec![
EvalAssertion::file_exists("hello.rs"),
EvalAssertion::command_succeeds("rustc hello.rs && ./hello"),
EvalAssertion::llm_judge(|ctx| {
format!(
"Does the file hello.rs contain a valid Rust main function \
that prints 'Hello, world!'?\n\nFiles:\n{}",
ctx.git_diff(None).unwrap_or_default()
)
}),
],
);
ConstructorWhat it checks
file_exists(path)File was created
file_matches(path, content)File has exact content
command_succeeds(cmd)Command exits with code 0
command_exit_code(cmd, code)Command exits with specific code
tool_call(name)Agent called this tool at least once
tool_call_with_args(name, args)Agent called tool with specific arguments
tool_call_exact(name, n)Agent called tool exactly n times
tool_call_at_least(name, n)Agent called tool at least n times
tool_call_at_most(name, n)Agent called tool at most n times
llm_judge(fn)An LLM evaluates the result (pass/fail)

The llm_judge assertion uses a second LLM to evaluate the agent’s output. The closure receives an LlmJudgeContext:

EvalAssertion::llm_judge(|ctx| {
// ctx.working_dir — the eval's working directory
// ctx.original_prompt — the prompt given to the agent
// ctx.messages — full conversation history
// ctx.git_diff(commit) — git diff of changes made
format!("Did the agent correctly implement the feature? Diff:\n{}",
ctx.git_diff(None).unwrap_or_default())
})

Each eval runs in an isolated working directory:

// Empty temp directory
WorkingDirectory::empty()?
// Copy of a local directory
WorkingDirectory::local("./test-fixtures/my-project")?
// Git repo checked out at a specific commit
WorkingDirectory::git_repo(
"https://github.com/user/repo",
"start-commit-sha",
"gold-commit-sha", // reference solution
Some("subdirectory"),
)?
use crucible::{EvalRunner, EvalsConfig, AetherRunner, FileSystemStore};
// Create the agent runner
let runner = AetherRunner::new(agent_config);
// Create the results store
let store = FileSystemStore::new("./eval-results");
// Configure and run
let run_id = EvalRunner::new(runner, store)
.with_agent_prompt("You are a coding assistant.")
.with_output_dir("./eval-output".into())
.run_evals(evals, EvalsConfig::new(judge_llm)
.with_batch_size(4)
)
.await?;
MethodDescription
new(judge_llm)Create config with the judge LLM
with_batch_size(n)Run n evals concurrently
with_batch_delay(duration)Delay between batches
with_serve(bool)Start HTTP server for live results

Run setup code before the agent starts or before assertions run:

let eval = Eval::new(name, prompt, working_dir, assertions)
.setup(|dir| async move {
// Run before the agent — e.g., create test fixtures
std::fs::write(dir.join("existing.txt"), "content")?;
Ok(())
})
.before_assertions(|dir| async move {
// Run after the agent, before assertions — e.g., build the project
Ok(())
});

Use FakeAgentRunner for deterministic testing of your eval setup:

use crucible::FakeAgentRunner;
let runner = FakeAgentRunner::new(vec![
AgentRunnerMessage::text("I'll create the file now"),
AgentRunnerMessage::tool_call("write_file", r#"{"path":"hello.rs","content":"..."}"#),
]);