LLMs

Overview

CrewAI integrates with multiple LLM providers through providers native sdks, giving you the flexibility to choose the right model for your specific use case. This guide will help you understand how to configure and use different LLM providers in your CrewAI projects.

When to Use Advanced LLM Configuration

You need strict control of latency, cost, and output format.
You need model routing by task type.
You need reproducible, policy-sensitive behavior in production.

When Not to Over-Configure

You are in early prototyping with one simple task path.
You do not yet need structured outputs or model routing.

What are LLMs?

Large Language Models (LLMs) are the core intelligence behind CrewAI agents. They enable agents to understand context, make decisions, and generate human-like responses. Here’s what you need to know:

LLM Basics

Large Language Models are AI systems trained on vast amounts of text data. They power the intelligence of your CrewAI agents, enabling them to understand and generate human-like text.

Context Window

The context window determines how much text an LLM can process at once. Larger windows (e.g., 128K tokens) allow for more context but may be more expensive and slower.

Temperature

Temperature (0.0 to 1.0) controls response randomness. Lower values (e.g., 0.2) produce more focused, deterministic outputs, while higher values (e.g., 0.8) increase creativity and variability.

Provider Selection

Each LLM provider (e.g., OpenAI, Anthropic, Google) offers different models with varying capabilities, pricing, and features. Choose based on your needs for accuracy, speed, and cost.

Setting up your LLM

There are different places in CrewAI code where you can specify the model to use. Once you specify the model you are using, you will need to provide the configuration (like an API key) for each of the model providers you use. See the provider configuration examples section for your provider.

1. Environment Variables
2. YAML Configuration
3. Direct Code

The simplest way to get started. Set the model in your environment directly, through an .env file or in your app code. If you used crewai create to bootstrap your project, it will be set already.

.env

MODEL=model-id  # e.g. gpt-4o, gemini-2.0-flash, claude-3-sonnet-...

# Be sure to set your API keys here too. See the Provider
# section below.

Never commit API keys to version control. Use environment files (.env) or your system’s secret management.

Create a YAML file to define your agent configurations. This method is great for version control and team collaboration:

agents.yaml

researcher:
    role: Research Specialist
    goal: Conduct comprehensive research and analysis
    backstory: A dedicated research professional with years of experience
    verbose: true
    llm: provider/model-id  # e.g. openai/gpt-4o, google/gemini-2.0-flash, anthropic/claude...
    # (see provider configuration examples below for more)

The YAML configuration allows you to:

Version control your agent settings
Easily switch between different models
Share configurations across team members
Document model choices and their purposes

For maximum flexibility, configure LLMs directly in your Python code:

from crewai import LLM

# Basic configuration
llm = LLM(model="model-id-here")  # gpt-4o, gemini-2.0-flash, anthropic/claude...

# Advanced configuration with detailed parameters
llm = LLM(
    model="model-id-here",  # gpt-4o, gemini-2.0-flash, anthropic/claude...
    temperature=0.7,        # Higher for more creative outputs
    timeout=120,            # Seconds to wait for response
    max_tokens=4000,        # Maximum length of response
    top_p=0.9,              # Nucleus sampling parameter
    frequency_penalty=0.1 , # Reduce repetition
    presence_penalty=0.1,   # Encourage topic diversity
    response_format={"type": "json"},  # For structured outputs
    seed=42                 # For reproducible results
)

Parameter explanations:

temperature: Controls randomness (0.0-1.0)
timeout: Maximum wait time for response
max_tokens: Limits response length
top_p: Alternative to temperature for sampling
frequency_penalty: Reduces word repetition
presence_penalty: Encourages new topics
response_format: Specifies output structure
seed: Ensures consistent outputs

Production LLM Patterns

The basics above show how to configure one model. In real systems, you usually combine several LLM patterns for cost, quality, and reliability.

Pattern 1: Route models by agent role

Use faster/cheaper models for extraction and heavier models for synthesis or critical decisions.

Code

from crewai import Agent, Crew, Process, Task

researcher = Agent(
    role="Researcher",
    goal="Collect factual inputs quickly",
    backstory="Fast information-gathering specialist",
    llm="openai/gpt-4o-mini",
)

reviewer = Agent(
    role="Reviewer",
    goal="Validate claims and produce final answer",
    backstory="Careful editor focused on correctness",
    llm="provider/model-id",
)

crew = Crew(
    agents=[researcher, reviewer],
    tasks=[
        Task(
            description="Find the latest policy changes and list the key points",
            expected_output="Bullet list of validated policy changes",
            agent=researcher,
        ),
        Task(
            description="Review findings and produce a final executive summary",
            expected_output="Concise, decision-ready summary",
            agent=reviewer,
        ),
    ],
    process=Process.sequential,
)

Pattern 2: Set reliability defaults once

Configure retry, timeout, and deterministic sampling in one reusable LLM object.

Code

from crewai import LLM

reliable_llm = LLM(
    model="openai/gpt-4o-mini",
    temperature=0.1,
    timeout=45,
    max_retries=3,
    max_tokens=1200,
    seed=7,
)

Use this for extraction, classification, and policy-sensitive tasks where variance should be low.

Pattern 3: Use structured outputs for machine-readable responses

For downstream automation, force JSON-shaped outputs rather than free-form prose.

Code

from crewai import LLM

json_llm = LLM(
    model="openai/gpt-4o",
    response_format={"type": "json"},
    temperature=0.0,
)

This reduces parser fragility in pipelines that feed APIs, databases, or workflow routers.

Pattern 4: Use OpenAI Responses API for multi-turn reasoning flows

When you need built-in tools, response chaining, or reasoning-model workflows, enable the Responses API explicitly.

Code

from crewai import LLM

reasoning_llm = LLM(
    model="openai/o4-mini",
    api="responses",
    auto_chain=True,
    store=True,
    reasoning_effort="medium",
)

This is especially useful in long-running assistants where you want conversation continuity and controllable reasoning depth.

Provider Configuration

For concept-level usage, keep provider setup minimal and explicit:

Set provider credentials via environment variables.
Pin model IDs explicitly in code or YAML.
Set reliability defaults (timeout, max_retries, low temperature) for production.

Use these pages for deeper provider setup and runtime decisions:

Connections and provider setup: /en/learn/llm-connections
Custom provider integration: /en/learn/custom-llm
Production routing and reliability patterns: /en/ai/llms/patterns
Parameter contract reference: /en/ai/llms/reference

Streaming Responses

CrewAI supports streaming responses from LLMs, allowing your application to receive and process outputs in real-time as they’re generated.

Basic Setup
Event Handling
Agent & Task Tracking

Enable streaming by setting the stream parameter to True when initializing your LLM:

from crewai import LLM

# Create an LLM with streaming enabled
llm = LLM(
    model="openai/gpt-4o",
    stream=True  # Enable streaming
)

When streaming is enabled, responses are delivered in chunks as they’re generated, creating a more responsive user experience.

CrewAI emits events for each chunk received during streaming:

from crewai.events import (
  LLMStreamChunkEvent
)
from crewai.events import BaseEventListener

class MyCustomListener(BaseEventListener):
    def setup_listeners(self, crewai_event_bus):
        @crewai_event_bus.on(LLMStreamChunkEvent)
        def on_llm_stream_chunk(self, event: LLMStreamChunkEvent):
          # Process each chunk as it arrives
          print(f"Received chunk: {event.chunk}")

my_listener = MyCustomListener()

Click here for more details

All LLM events in CrewAI include agent and task information, allowing you to track and filter LLM interactions by specific agents or tasks:

from crewai import LLM, Agent, Task, Crew
from crewai.events import LLMStreamChunkEvent
from crewai.events import BaseEventListener

class MyCustomListener(BaseEventListener):
    def setup_listeners(self, crewai_event_bus):
        @crewai_event_bus.on(LLMStreamChunkEvent)
        def on_llm_stream_chunk(source, event):
            if researcher.id == event.agent_id:
                print("\n==============\n Got event:", event, "\n==============\n")


my_listener = MyCustomListener()

llm = LLM(model="gpt-4o-mini", temperature=0, stream=True)

researcher = Agent(
    role="About User",
    goal="You know everything about the user.",
    backstory="""You are a master at understanding people and their preferences.""",
    llm=llm,
)

search = Task(
    description="Answer the following questions about the user: {question}",
    expected_output="An answer to the question.",
    agent=researcher,
)

crew = Crew(agents=[researcher], tasks=[search])

result = crew.kickoff(
    inputs={"question": "..."}
)

This feature is particularly useful for:

Debugging specific agent behaviors
Logging LLM usage by task type
Auditing which agents are making what types of LLM calls
Performance monitoring of specific tasks

Async LLM Calls

CrewAI supports asynchronous LLM calls for improved performance and concurrency in your AI workflows. Async calls allow you to run multiple LLM requests concurrently without blocking, making them ideal for high-throughput applications and parallel agent operations.

Basic Usage
With Streaming

Use the acall method for asynchronous LLM requests:

import asyncio
from crewai import LLM

async def main():
    llm = LLM(model="openai/gpt-4o")

    # Single async call
    response = await llm.acall("What is the capital of France?")
    print(response)

asyncio.run(main())

The acall method supports all the same parameters as the synchronous call method, including messages, tools, and callbacks.

Combine async calls with streaming for real-time concurrent responses:

import asyncio
from crewai import LLM

async def stream_async():
    llm = LLM(model="openai/gpt-4o", stream=True)

    response = await llm.acall("Write a short story about AI")

    print(response)

asyncio.run(stream_async())

Structured LLM Calls

CrewAI supports structured responses from LLM calls by allowing you to define a response_format using a Pydantic model. This enables the framework to automatically parse and validate the output, making it easier to integrate the response into your application without manual post-processing. For example, you can define a Pydantic model to represent the expected response structure and pass it as the response_format when instantiating the LLM. The model will then be used to convert the LLM output into a structured Python object.

Code

from crewai import LLM

class Dog(BaseModel):
    name: str
    age: int
    breed: str


llm = LLM(model="gpt-4o", response_format=Dog)

response = llm.call(
    "Analyze the following messages and return the name, age, and breed. "
    "Meet Kona! She is 3 years old and is a black german shepherd."
)
print(response)

# Output:
# Dog(name='Kona', age=3, breed='black german shepherd')

Advanced Features and Optimization

Learn how to get the most out of your LLM configuration:

Context Window Management

CrewAI includes smart context management features:

from crewai import LLM

# CrewAI automatically handles:
# 1. Token counting and tracking
# 2. Content summarization when needed
# 3. Task splitting for large contexts

llm = LLM(
    model="gpt-4",
    max_tokens=4000,  # Limit response length
)

Best practices for context management:

Choose models with appropriate context windows
Pre-process long inputs when possible
Use chunking for large documents
Monitor token usage to optimize costs

Performance Optimization

Token Usage Optimization

Choose the right context window for your task:

Small tasks (up to 4K tokens): Standard models
Medium tasks (between 4K-32K): Enhanced models
Large tasks (over 32K): Large context models

# Configure model with appropriate settings
llm = LLM(
    model="openai/gpt-4-turbo-preview",
    temperature=0.7,    # Adjust based on task
    max_tokens=4096,    # Set based on output needs
    timeout=300        # Longer timeout for complex tasks
)

Lower temperature (0.1 to 0.3) for factual responses
Higher temperature (0.7 to 0.9) for creative tasks

Best Practices

Monitor token usage
Implement rate limiting
Use caching when possible
Set appropriate max_tokens limits

Remember to regularly monitor your token usage and adjust your configuration as needed to optimize costs and performance.

Drop Additional Parameters

CrewAI internally uses native sdks for LLM calls, which allows you to drop additional parameters that are not needed for your specific use case. This can help simplify your code and reduce the complexity of your LLM configuration. For example, if you don’t need to send the stop parameter, you can simply omit it from your LLM call:

from crewai import LLM
import os

os.environ["OPENAI_API_KEY"] = "<api-key>"

o3_llm = LLM(
    model="o3",
    drop_params=True,
    additional_drop_params=["stop"]
)

Transport Interceptors

CrewAI provides message interceptors for several providers, allowing you to hook into request/response cycles at the transport layer.Supported Providers:

✅ OpenAI
✅ Anthropic

Basic Usage:

import httpx
from crewai import LLM
from crewai.llms.hooks import BaseInterceptor

class CustomInterceptor(BaseInterceptor[httpx.Request, httpx.Response]):
"""Custom interceptor to modify requests and responses."""

def on_outbound(self, request: httpx.Request) -> httpx.Request:
    """Print request before sending to the LLM provider."""
    print(request)
    return request

def on_inbound(self, response: httpx.Response) -> httpx.Response:
    """Process response after receiving from the LLM provider."""
    print(f"Status: {response.status_code}")
    print(f"Response time: {response.elapsed}")
    return response

# Use the interceptor with an LLM
llm = LLM(
model="openai/gpt-4o",
interceptor=CustomInterceptor()
)

Important Notes:

Both methods must return the received object or type of object.
Modifying received objects may result in unexpected behavior or application crashes.
Not all providers support interceptors - check the supported providers list above

Interceptors operate at the transport layer. This is particularly useful for:

Message transformation and filtering
Debugging API interactions

Common Issues and Solutions

Authentication
Model Names
Context Length

Most authentication issues can be resolved by checking API key format and environment variable names.

# OpenAI
OPENAI_API_KEY=sk-...

# Anthropic
ANTHROPIC_API_KEY=sk-ant-...

Always include the provider prefix in model names

# Correct
llm = LLM(model="openai/gpt-4")

# Incorrect
llm = LLM(model="gpt-4")

Use larger context models for extensive tasks

# Large context model
llm = LLM(model="openai/gpt-4o")  # 128K tokens

Get Started

AI Docs

Core Concepts

Guides

MCP Integration

Tools

Observability

Learn

Telemetry

Overview

When to Use Advanced LLM Configuration

When Not to Over-Configure

What are LLMs?

LLM Basics

Context Window

Temperature

Provider Selection

Setting up your LLM

Production LLM Patterns

Pattern 1: Route models by agent role

Pattern 2: Set reliability defaults once

Pattern 3: Use structured outputs for machine-readable responses

Pattern 4: Use OpenAI Responses API for multi-turn reasoning flows

Provider Configuration

Streaming Responses

Async LLM Calls

Structured LLM Calls

Advanced Features and Optimization

Common Issues and Solutions

Get Started

AI Docs

Core Concepts

Guides

MCP Integration

Tools

Observability

Learn

Telemetry

Documentation Index

​Overview

​When to Use Advanced LLM Configuration

​When Not to Over-Configure

​What are LLMs?

LLM Basics

Context Window

Temperature

Provider Selection

​Setting up your LLM

​Production LLM Patterns

​Pattern 1: Route models by agent role

​Pattern 2: Set reliability defaults once

​Pattern 3: Use structured outputs for machine-readable responses

​Pattern 4: Use OpenAI Responses API for multi-turn reasoning flows

​Provider Configuration

​Streaming Responses

​Async LLM Calls

​Structured LLM Calls

​Advanced Features and Optimization

​Common Issues and Solutions

Overview

When to Use Advanced LLM Configuration

When Not to Over-Configure

What are LLMs?

Setting up your LLM

Production LLM Patterns

Pattern 1: Route models by agent role

Pattern 2: Set reliability defaults once

Pattern 3: Use structured outputs for machine-readable responses

Pattern 4: Use OpenAI Responses API for multi-turn reasoning flows

Provider Configuration

Streaming Responses

Async LLM Calls

Structured LLM Calls

Advanced Features and Optimization

Common Issues and Solutions