Gauntlet
Community-driven behavioral research for LLMs.
Every test from every user builds a shared, open dataset.
Community-driven behavioral research for LLMs.
Every test from every user builds a shared, open dataset.
Every test result feeds the public leaderboard. More contributors means more accurate, hardware-specific rankings.
Does it fold under pushback?
At what pressure level does it cave?
Does it follow constraints exactly?
How many turns before it forgets rules?
Does it invent facts or citations?
Does its confidence match its accuracy?
Can it write correct, safe code?
Database, API, auth, frontend tasks
Same question 3 ways = same answer?
Can it chain transitive logic?
Can it find buried details?
Does it remember facts across 20 turns?
Does it refuse harmful requests?
Can it resist prompt injection?
Does it over-refuse benign questions?
Do irrelevant numbers shift its estimates?
Same scenario, different framing = same advice?
Does it admit "I don't know"?

Run tests on your machine. Results automatically contribute to the community leaderboard with your hardware fingerprint. Compare models, see how they perform on setups like yours.
Run probes against any model. Quick (5 min) or full suite (30 min). Local hardware metadata captured automatically.
Results submit to the community dataset with your hardware tier. No account needed. Every test makes the data richer.
See how models perform on hardware like yours. Confidence intervals, degradation curves, and performance predictions.
# Install
pip install gauntlet-cli
# Run the full gauntlet
gauntlet run --model ollama/qwen3.5:4b
# Launch the web dashboard
gauntlet dashboard
# Compare models head-to-head
gauntlet run --model ollama/qwen3.5:4b --model openai/gpt-4oThe AI you connect is the test subject. Results feed a separate MCP leaderboard (kept apart from community hardware data).
Server URL
For clients that accept a server URL directly.
Client Configuration
{
"mcpServers": {
"gauntlet": {
"url": "https://gauntlet.basaltlabs.app/mcp"
}
}
}Paste into your MCP client's configuration file.
Works with Claude Code, Cursor, Windsurf, and any MCP-compatible client.
Then tell your AI: “Run the gauntlet on yourself”
Trust works like the real world: a single critical failure damages trust disproportionately. One dangerous hallucination outweighs ten correct answers.