Writing Evals
Crucible is Aether’s evaluation framework. It lets you define test cases for agent behavior — prompts with expected outcomes — and run them against real or fake agents.
[dependencies]crucible = "0.1"Defining an eval
Section titled “Defining an eval”An Eval pairs a prompt with assertions about what the agent should do:
use crucible::{Eval, EvalAssertion, WorkingDirectory};
let eval = Eval::new( "create-hello-world", "Create a file called hello.rs with a main function that prints 'Hello, world!'", WorkingDirectory::empty()?, vec![ EvalAssertion::file_exists("hello.rs"), EvalAssertion::command_succeeds("rustc hello.rs && ./hello"), EvalAssertion::llm_judge(|ctx| { format!( "Does the file hello.rs contain a valid Rust main function \ that prints 'Hello, world!'?\n\nFiles:\n{}", ctx.git_diff(None).unwrap_or_default() ) }), ],);Assertions
Section titled “Assertions”| Constructor | What it checks |
|---|---|
file_exists(path) | File was created |
file_matches(path, content) | File has exact content |
command_succeeds(cmd) | Command exits with code 0 |
command_exit_code(cmd, code) | Command exits with specific code |
tool_call(name) | Agent called this tool at least once |
tool_call_with_args(name, args) | Agent called tool with specific arguments |
tool_call_exact(name, n) | Agent called tool exactly n times |
tool_call_at_least(name, n) | Agent called tool at least n times |
tool_call_at_most(name, n) | Agent called tool at most n times |
llm_judge(fn) | An LLM evaluates the result (pass/fail) |
LLM judge
Section titled “LLM judge”The llm_judge assertion uses a second LLM to evaluate the agent’s output. The closure receives an LlmJudgeContext:
EvalAssertion::llm_judge(|ctx| { // ctx.working_dir — the eval's working directory // ctx.original_prompt — the prompt given to the agent // ctx.messages — full conversation history // ctx.git_diff(commit) — git diff of changes made
format!("Did the agent correctly implement the feature? Diff:\n{}", ctx.git_diff(None).unwrap_or_default())})Working directories
Section titled “Working directories”Each eval runs in an isolated working directory:
// Empty temp directoryWorkingDirectory::empty()?
// Copy of a local directoryWorkingDirectory::local("./test-fixtures/my-project")?
// Git repo checked out at a specific commitWorkingDirectory::git_repo( "https://github.com/user/repo", "start-commit-sha", "gold-commit-sha", // reference solution Some("subdirectory"),)?Running evals
Section titled “Running evals”use crucible::{EvalRunner, EvalsConfig, AetherRunner, FileSystemStore};
// Create the agent runnerlet runner = AetherRunner::new(agent_config);
// Create the results storelet store = FileSystemStore::new("./eval-results");
// Configure and runlet run_id = EvalRunner::new(runner, store) .with_agent_prompt("You are a coding assistant.") .with_output_dir("./eval-output".into()) .run_evals(evals, EvalsConfig::new(judge_llm) .with_batch_size(4) ) .await?;EvalsConfig
Section titled “EvalsConfig”| Method | Description |
|---|---|
new(judge_llm) | Create config with the judge LLM |
with_batch_size(n) | Run n evals concurrently |
with_batch_delay(duration) | Delay between batches |
with_serve(bool) | Start HTTP server for live results |
Run setup code before the agent starts or before assertions run:
let eval = Eval::new(name, prompt, working_dir, assertions) .setup(|dir| async move { // Run before the agent — e.g., create test fixtures std::fs::write(dir.join("existing.txt"), "content")?; Ok(()) }) .before_assertions(|dir| async move { // Run after the agent, before assertions — e.g., build the project Ok(()) });Testing without an LLM
Section titled “Testing without an LLM”Use FakeAgentRunner for deterministic testing of your eval setup:
use crucible::FakeAgentRunner;
let runner = FakeAgentRunner::new(vec![ AgentRunnerMessage::text("I'll create the file now"), AgentRunnerMessage::tool_call("write_file", r#"{"path":"hello.rs","content":"..."}"#),]);