AWS Lambda Deployment
Deploy Fair Forge generators, runners, and metrics as AWS Lambda functions for serverless execution.Available Lambda Functions
| Function | Purpose | Endpoint |
|---|---|---|
| BestOf | Tournament-style AI assistant comparison | POST /run |
| Generators | Generate test datasets from context | POST /run |
| Runners | Execute tests against AI systems | POST /run |
BestOf Metric Lambda
Run tournament-style comparisons between multiple AI assistants to determine which performs best.How It Works
- Submit datasets from multiple assistants (same questions, different responses)
- The LLM judge evaluates head-to-head matchups
- Winners advance through elimination rounds
- Returns the tournament winner with detailed contest results
Supported LLM Providers
| Provider | class_path |
|---|---|
| Groq | langchain_groq.chat_models.ChatGroq |
| OpenAI | langchain_openai.chat_models.ChatOpenAI |
| Google Gemini | langchain_google_genai.chat_models.ChatGoogleGenerativeAI |
| Ollama | langchain_ollama.chat_models.ChatOllama |
Request Format
Configuration Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
criteria | str | "Overall response quality" | Evaluation criteria for judging |
use_structured_output | bool | true | Use LangChain structured output |
verbose | bool | false | Enable verbose logging |
Example: Compare Two Assistants
Response Format
Response Fields
| Field | Type | Description |
|---|---|---|
winner | str | The assistant_id of the tournament winner |
contestants | list | All assistant_ids that participated |
total_rounds | int | Number of tournament rounds |
contests | list | Details of each head-to-head matchup |
contests[].confidence | float | Judge’s confidence in decision (0-1) |
contests[].verdict | str | Brief summary of the decision |
contests[].reasoning | str | Detailed reasoning for the decision |
Generators Lambda
Generate synthetic test datasets from markdown content using any LLM.Supported LLM Providers
| Provider | class_path |
|---|---|
| Groq | langchain_groq.chat_models.ChatGroq |
| OpenAI | langchain_openai.chat_models.ChatOpenAI |
| Google Gemini | langchain_google_genai.chat_models.ChatGoogleGenerativeAI |
| Ollama | langchain_ollama.chat_models.ChatOllama |
Request Format
Configuration Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
assistant_id | str | Required | ID for generated dataset |
num_queries | int | 3 | Questions per chunk |
language | str | "english" | Language for generation |
conversation_mode | bool | false | Generate conversations |
max_chunk_size | int | 2000 | Max chars per chunk |
min_chunk_size | int | 200 | Min chars per chunk |
seed_examples | list[str] | null | Example questions for style |
Example: Using Groq
Example: Using OpenAI
Response Format
Runners Lambda
Execute test datasets against AI systems.Modes
LLM Mode: Direct execution against any LangChain-compatible LLM Alquimia Mode: Execution against Alquimia AI agentsLLM Mode Request
Alquimia Mode Request
Example: LLM Mode
Response Format
Deployment
Prerequisites
- AWS CLI configured
- Docker installed
- AWS ECR repository access
Deploy BestOf Metric
Deploy Generators
Deploy Runners
View Logs
Architecture
Integration Example
Combine generators and runners:Next Steps
BestOf Metric
Learn about tournament-style evaluation
Generators
Learn about test generation
Runners
Learn about test execution