Dataset & Batch
Fair Forge uses two primary data structures to represent conversation data:Dataset and Batch.
Dataset
ADataset represents a complete conversation session with an AI assistant.
Fields
| Field | Type | Required | Description |
|---|---|---|---|
session_id | str | Yes | Unique identifier for the conversation session |
assistant_id | str | Yes | Identifier of the assistant being evaluated |
language | str | None | No | Language of the conversation (e.g., “english”, “spanish”) |
context | str | Yes | System context/instructions provided to the assistant |
conversation | list[Batch] | Yes | List of Q&A interactions in the conversation |
Example
Batch
ABatch represents a single question-answer interaction within a conversation.
Fields
| Field | Type | Required | Description |
|---|---|---|---|
qa_id | str | Yes | Unique identifier for this interaction |
query | str | Yes | User’s question or message |
assistant | str | Yes | Assistant’s response |
ground_truth_assistant | str | None | No | Expected/ideal response for comparison |
observation | str | None | No | Additional notes or observations |
agentic | dict | None | No | Metadata about the interaction |
ground_truth_agentic | dict | None | No | Expected metadata |
logprobs | dict | None | No | Log probabilities from the model |
weight | float | None | No | Relative importance when aggregating session-level scores |
Field Details
qa_id
qa_id
A unique identifier for the interaction within the conversation. Use a consistent naming scheme:
query
query
The user’s input message or question. This is what the assistant is responding to:
assistant
assistant
The assistant’s actual response. This is what gets evaluated:
ground_truth_assistant
ground_truth_assistant
The expected or ideal response. Used by some metrics for comparison:This is optional but recommended for metrics like Context and Conversational.
observation
observation
Additional context or notes about the interaction:
agentic
agentic
Metadata dictionary for storing additional information:Used by generators to store query metadata.
logprobs
logprobs
Log probabilities from the model (if available):
weight
weight
Optional relative importance of this interaction when metrics aggregate scores at session level (Conversational, Context, Regulatory). Controls how much each QA pair contributes to the session mean.If all weights are provided but do not sum to 1.0, a warning is emitted and equal weights are applied instead.
JSON Format
The data structures can be easily serialized to/from JSON:Dataset JSON
Loading from JSON
Saving to JSON
Pydantic Validation
BothDataset and Batch are Pydantic models with built-in validation:
Usage with Metrics
Different metrics use different fields:| Metric | Key Fields Used | Output Granularity |
|---|---|---|
| Toxicity | assistant, session_id, assistant_id | Global stream |
| Bias | query, assistant, context | Per session |
| Context | query, assistant, context, ground_truth_assistant, weight | Per session |
| Conversational | query, assistant, observation, ground_truth_assistant, weight | Per session |
| Regulatory | query, assistant, weight | Per session |
| Humanity | assistant, ground_truth_assistant | Per interaction |
| BestOf | query, assistant | Per block |
| Agentic | query, assistant, ground_truth_assistant, agentic, ground_truth_agentic | Per conversation |
Best Practices
Use Descriptive IDs
Choose meaningful
qa_id values that help identify issues:Include Ground Truth
When possible, include
ground_truth_assistant for better evaluation:Set Context Properly
The
context field should contain system instructions:Use Metadata
Store useful metadata in
agentic for analysis:Next Steps
Statistical Modes
Learn about Frequentist vs Bayesian analysis
Metrics Overview
See how metrics use these structures