Custom Code Evaluator

Sometimes, the default evaluators in Agenta may not be sufficient for your specific use case. In such cases, you can create a custom evaluator to suit your specific needs. Custom evaluators are written in Python, JavaScript, or TypeScript.

Evaluation code

Your custom evaluator should include a function called evaluate with the following signature:

from typing import Dict

def evaluate(
    app_params: Dict[str, str],
    inputs: Dict[str, str],
    output: str,
    correct_answer: str
) -> float:

This function should return a float value representing the evaluation score. The score ranges from 0.0 to 1.0, where 0.0 indicates a failed evaluation and 1.0 indicates a perfect score.

The function parameters are:

app_params: A dictionary containing the configuration of the app. This would include the prompt, model and all the other parameters specified in the playground with the same naming.
inputs: A dictionary containing the inputs of the app.
output: The generated output of the app.
correct_answer: The correct answer of the app.

Here's an example implementation of an exact match evaluator:

from typing import Dict

def evaluate(
    app_params: Dict[str, str],
    inputs: Dict[str, str],
    output: str,
    correct_answer: str
) -> float:
    return 1 if output == correct_answer else 0

Evaluation code​

Evaluation code