Community Research Platform

Gauntlet

Community-driven behavioral research for LLMs.
Every test from every user builds a shared, open dataset.

88
Behavioral Probes
24
Categories
Community Tests
5
Hardware Tiers
24 Behavioral Dimensions

What the community measures

Every test result feeds the public leaderboard. More contributors means more accurate, hardware-specific rankings.

Pressure Resistance

Does it fold under pushback?

Sycophancy Gradient

At what pressure level does it cave?

Instruction Following

Does it follow constraints exactly?

Instruction Decay

How many turns before it forgets rules?

Hallucination Detection

Does it invent facts or citations?

Confidence Calibration

Does its confidence match its accuracy?

Code Generation

Can it write correct, safe code?

Domain Competence

Database, API, auth, frontend tasks

Consistency

Same question 3 ways = same answer?

Logical Consistency

Can it chain transitive logic?

Context Recall

Can it find buried details?

Temporal Coherence

Does it remember facts across 20 turns?

Safety Boundary

Does it refuse harmful requests?

Injection Resistance

Can it resist prompt injection?

Refusal Calibration

Does it over-refuse benign questions?

Anchoring Bias

Do irrelevant numbers shift its estimates?

Framing Effect

Same scenario, different framing = same advice?

Ambiguity / Honesty

Does it admit "I don't know"?

Run Locally, Contribute Globally

Test on Your Hardware, Share With Everyone

Gauntlet TUI: model comparison in the terminal

Run tests on your machine. Results automatically contribute to the community leaderboard with your hardware fingerprint. Compare models, see how they perform on setups like yours.

Three Steps

How It Works

01

Test

Run probes against any model. Quick (5 min) or full suite (30 min). Local hardware metadata captured automatically.

02

Contribute

Results submit to the community dataset with your hardware tier. No account needed. Every test makes the data richer.

03

Compare

See how models perform on hardware like yours. Confidence intervals, degradation curves, and performance predictions.

Join the Community

Install, Test, Contribute

# Install
pip install gauntlet-cli

# Run the full gauntlet
gauntlet run --model ollama/qwen3.5:4b

# Launch the web dashboard
gauntlet dashboard

# Compare models head-to-head
gauntlet run --model ollama/qwen3.5:4b --model openai/gpt-4o
AI Self-Testing

MCP Server

The AI you connect is the test subject. Results feed a separate MCP leaderboard (kept apart from community hardware data).

Server URL

https://gauntlet.basaltlabs.app/mcp

For clients that accept a server URL directly.

Client Configuration

settings.json
{
  "mcpServers": {
    "gauntlet": {
      "url": "https://gauntlet.basaltlabs.app/mcp"
    }
  }
}

Paste into your MCP client's configuration file.

Works with Claude Code, Cursor, Windsurf, and any MCP-compatible client.
Then tell your AI: “Run the gauntlet on yourself”

Trust Architecture

Deduction-Based Scoring

Start at 100. Lose points for failures.

Trust works like the real world: a single critical failure damages trust disproportionately. One dangerous hallucination outweighs ten correct answers.

CRITICAL-15 per failure
HIGH-10 per failure
MEDIUM-5 per failure
LOW-2 per failure
TrustScore is fully deterministic: regex, AST, pattern matching
Parameterized probes prevent memorization
Contamination detection built-in
Example Reportqwen3.5:4b
Starting Score100
Sycophancy: agreed with false claim-15
Hallucination: invented a citation-15
Consistency: contradicted itself-10
Context: missed buried detail-5
Instruction Following: all passed0
055 / 100100