{% extends "base.html" %} {% from "components/custom_dropdown.html" import render_dropdown %} {% set active_page = 'benchmark' %} {% block title %}Benchmark Configuration - Deep Research System{% endblock %} {% block extra_head %} {% endblock %} {% block content %}

Benchmark Guidelines

Purpose: Benchmarks are designed to help you evaluate if your configuration works well, not for research papers or production use.

Responsible Usage: Please use reasonable example counts to avoid overwhelming search engines. The default of 75 examples provides a good balance for configuration testing.

Requirements: Benchmarks require an evaluation model for grading results. You can configure your preferred provider and model in the Evaluation Settings below. The default uses OpenRouter with Claude 3.7 Sonnet, but you can choose from various providers including OpenAI, Anthropic, or local models.

Search Engine Recommendations

  • Tavily: Recommended for general knowledge benchmarks - AI-optimized search API, reliable results
  • Brave: Independent search engine but unknown why performance is lower - could be smaller index, different ranking algorithm, or API limitations
  • SearXNG: Often outperforms commercial APIs by aggregating multiple sources - shared resource, use moderate example counts
  • Specialized engines (ArXiv, PubMed, Wikipedia): Shared resources that are useless for general SimpleQA questions - should not be used for this test

🔧 For Shared Resources: When using SearXNG or other shared engines, reduce iterations and questions per iteration in Settings to minimize load on shared infrastructure.

Quick Check
Test your config with reasonable limits
🎯 Configuration Testing
Give your benchmark run a descriptive name
Dataset Selection

SimpleQA

Fact-based questions with clear answers

Recommended: 50 examples provides good balance for configuration testing

BrowseComp

Complex browsing and comparison tasks

Poor Performance Warning: We currently achieve close to 0% accuracy on BrowseComp.
For testing only: Limited to 20 examples max to see what this benchmark is about.
Restricted to max 20 examples due to poor performance - for curiosity testing only
Current Configuration

Active Database Settings

Benchmark will use all settings from your database configuration

Provider
Loading...
Model
Loading...
Search Tool
Loading...
Iterations
Loading...
Questions/Iter
Loading...
Strategy
Loading...
To change any settings, go to Settings Dashboard
Evaluation Model Settings

Benchmark Evaluation Configuration

Configure the model used to grade benchmark results

Provider for the evaluation model
{{ render_dropdown( input_id="evaluation_model", dropdown_id="evaluation-model-dropdown", placeholder="Enter or select evaluation model", label="Evaluation Model", help_text="Model to grade benchmark results", allow_custom=true, show_refresh=true, refresh_aria_label="Refresh evaluation model list", data_initial_value=eval_settings.evaluation_model ) }}
API endpoint for evaluation model
0 recommended for consistent evaluation
Evaluation Model Selection: For accurate benchmark grading, use flagship models from major providers like Claude Sonnet series or GPT-4 class models. Local models and smaller cloud models may produce inconsistent evaluations, affecting benchmark accuracy scores. However, preliminary tests indicate that local models might be adequate for performance evaluation if highest grade standards are not required.
50
Total Examples
Estimated time: 40-60 minutes
Current Benchmark:
0%
Status:
Initializing
Current Task:
Starting benchmark...
--%
Overall Accuracy
--
--
Est. Time Left
--
0
Completed
--
Avg Time/Example
SimpleQA: --% BrowseComp: --%

Current Question

No question being processed...
-- --
Waiting for benchmark to start...

All Results

No results yet...
{% endblock %} {% block page_scripts %} {% endblock %}