Analyze, measure, and improve AI agent conversations with structured evaluation workflows and hierarchical insights.

TurnWise organizes conversations into a hierarchical structure, enabling evaluation at any granularity level.

A complete multi-turn dialogue between users and AI agents. Evaluate the entire conversation flow and overall quality.
Individual messages within a conversation (user, assistant, system, tool). Analyze each exchange independently.
Individual reasoning steps within a message (thinking, tool calls, outputs). Dive deep into the agent's thought process.
Everything you need to evaluate and improve your LLM conversations.
Evaluate entire conversations, individual messages, or specific reasoning steps. Create custom evaluation metrics with prompts and output schemas. Run evaluations on-demand or in batch.
Automatically maintain compressed summaries of long conversations. Prevents context window overflow when evaluating lengthy dialogues. Incrementally updates summaries as conversations grow.
Define reusable evaluation workflows (pipelines). Each pipeline contains multiple evaluation nodes (metrics). Execute pipelines across datasets with streaming results.
Organize conversations into datasets. Track LLM calls, costs, and performance metrics. Store structured outputs and metadata for comprehensive analysis.
Get started in three simple steps.
Import your multi-turn conversation data with messages and steps. Organize them into datasets for easy management.
Create custom evaluation metrics with prompts and output schemas. Build reusable evaluation pipelines.
Execute evaluations and see results streaming in real-time. Get insights at conversation, message, or step level.
TurnWise includes a powerful set of evaluation metrics designed specifically for multi-turn LLM agent conversations. Evaluate at message, step, or conversation level.
Conversation Continuity Metric - Detects when users re-ask similar questions, indicating the previous response was incomplete or unsatisfactory.
Response Dissatisfaction Metric - Identifies explicit user corrections or expressions of dissatisfaction with the assistant's response.
Tool Selection Error - Evaluates whether the correct tool was selected for the given task context.
Parameter Hallucination - Detects hallucinated or fabricated parameters passed to tools (e.g., invented file paths, non-existent IDs).
Self-Correction Detection - Measures the agent's ability to recognize and recover from its own errors.
Tool Use Metrics - Comprehensive multi-dimensional analysis of tool usage including selection, parameter accuracy, and result handling.
Tool Chain Inefficiency - Identifies redundant, circular, or inefficient sequences of tool calls.
Agent Trajectory Analysis - Analyzes conversation patterns for circular reasoning, regression, stalls, and goal drift.
Intent Drift Metric - Measures how well the agent maintains alignment with the original user intent throughout the conversation.
TurnWise provides two variants of each metric:
@HISTORY, {goal}, {tools} for context-aware evaluation.Metrics can output different types of results:
Join TurnWise and start evaluating your multi-turn LLM conversations with powerful, hierarchical insights.