Advanced Evals for Tool-Calling Responses

Until now, Evals enabled you to test whether your agent called the correct tool. Today, we’re taking it a step further. You can now evaluate the agent’s actual response after the tool has executed, ensuring that the final outcome for the user is exactly what you intended.
With these new evaluation capabilities, you can now:
- Verify the final agent response: Move beyond testing just the tool trigger. Ensure your agents correctly interpret the tool’s output and communicate it back to the user with accuracy.
- Define custom success criteria: Set precise rules for what makes a response successful—like ensuring a booking confirmation number is provided.
- Use success and failure examples: Provide sample responses to guide the evaluation engine, helping it understand the nuances of your agent’s desired behavior.
- Test end-to-end reliability: Build more robust agents by testing the entire tool-calling lifecycle, from trigger to final confirmation.
To start evaluating tool responses, navigate to your Evals, select a test case with a tool call, and toggle on Evaluate Agent Response.

