Benchmarks
Sentinel performance across safety benchmarks.
Overview
Tested across 4 benchmarks on 6 models with 97.6% average safety rate.
Results by Benchmark
HarmBench (LLM Text Safety)
| Model | Baseline | +Sentinel | Improvement |
|---|---|---|---|
| GPT-4o | 83.2% | 95.4% | +12.2% |
| Claude 3.5 | 87.1% | 97.2% | +10.1% |
| Llama 3 70B | 72.3% | 94.6% | +22.3% |
SafeAgentBench (Agent Safety)
| Model | Baseline | +Sentinel | Improvement |
|---|---|---|---|
| GPT-4o | 78.4% | 97.1% | +18.7% |
| Claude 3.5 | 81.2% | 98.3% | +17.1% |
| Gemini 1.5 | 74.9% | 95.8% | +20.9% |
BadRobot (Physical Safety)
| Model | Baseline | +Sentinel | Improvement |
|---|---|---|---|
| GPT-4o | 51.2% | 99.1% | +47.9% |
| Claude 3.5 | 54.8% | 99.6% | +44.8% |
JailbreakBench (All Surfaces)
Average: 97% safety rate across all models.
Key Insights
1. Stakes Correlation: Larger improvements as stakes increase
- Text: +10-22%
- Agents: +16-26%
- Robots: +48%
2. Anti-Self-Preservation: Removing ASP drops SafeAgentBench by 6.7%
3. Seed Level Impact:
- Minimal: Baseline +8%
- Standard: Baseline +15%
- Full: Baseline +22%
Latency
| Operation | Time |
|---|---|
| Heuristic validation | <10ms |
| Semantic validation | 1-5s |
| Memory signing | <1ms |
| Memory verification | <1ms |
Cost
| Operation | Cost |
|---|---|
| Heuristic | Free |
| Semantic (gpt-4o-mini) | ~$0.0005/call |
| Semantic (gpt-4o) | ~$0.002/call |
Running Your Own
# Clone evaluation repo
git clone https://github.com/sentinel-seed/sentinel-evaluations
# Run benchmarks
cd sentinel-evaluations
pip install -r requirements.txt
python run_benchmarks.py