Testing

Overview

Testing is a crucial part of the development process, and it is essential to ensure that your crew is performing as expected. With crewAI, you can easily test your crew and evaluate its performance using the built-in testing capabilities.

When to Use Testing

Before promoting a crew to production.
After changing prompts, tools, or model configurations.
When benchmarking quality/cost/latency tradeoffs.

When Not to Rely on Testing Alone

For safety-critical deployments without human review gates.
When test datasets are too small or unrepresentative.

Using the Testing Feature

Use the CLI command crewai test to run repeated crew executions and compare outputs across iterations. The parameters are n_iterations and model, which are optional and default to 2 and gpt-4o-mini.

crewai test

If you want to run more iterations or use a different model, you can specify the parameters like this:

crewai test --n_iterations 5 --model gpt-4o

or using the short forms:

crewai test -n 5 -m gpt-4o

When you run the crewai test command, the crew will be executed for the specified number of iterations, and the performance metrics will be displayed at the end of the run. A table of scores at the end will show the performance of the crew in terms of the following metrics:

Tasks/Crew/Agents	Run 1	Run 2	Avg. Total	Agents	Additional Info
Task 1	9.0	9.5	9.2	Professional Insights
				Researcher
Task 2	9.0	10.0	9.5	Company Profile Investigator
Task 3	9.0	9.0	9.0	Automation Insights
				Specialist
Task 4	9.0	9.0	9.0	Final Report Compiler	Automation Insights Specialist
Crew	9.00	9.38	9.2
Execution Time (s)	126	145	135

The example above shows the test results for two runs of the crew with two tasks, with the average total score for each task and the crew as a whole.

Common Failure Modes

Scores fluctuate too much between runs

Cause: high sampling randomness or unstable prompts.
Fix: lower temperature and tighten output constraints.

Good test scores but poor production quality

Cause: test prompts do not match real workload.
Fix: build a representative test set from real production inputs.

Planning

CLI

⌘I

Get Started

AI Docs

Core Concepts

Guides

MCP Integration

Tools

Observability

Learn

Telemetry

Overview

When to Use Testing

When Not to Rely on Testing Alone

Using the Testing Feature

Common Failure Modes

Scores fluctuate too much between runs

Good test scores but poor production quality

Get Started

AI Docs

Core Concepts

Guides

MCP Integration

Tools

Observability

Learn

Telemetry

Documentation Index

​Overview

​When to Use Testing

​When Not to Rely on Testing Alone

​Using the Testing Feature

​Common Failure Modes

​Scores fluctuate too much between runs

​Good test scores but poor production quality

Overview

When to Use Testing

When Not to Rely on Testing Alone

Using the Testing Feature

Common Failure Modes

Scores fluctuate too much between runs

Good test scores but poor production quality